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In This Issue 


This issue of Survey Methodology is dedicated to Gordon J. Brackstone, who recently retired from 
Statistics Canada. He was Assistant Chief Statistician for the Informatics and Methodology field and 
had been chairman of the Survey Methodology management board since 1987. His continuous support 
to the journal has been marked by great insight and motivated by a constant desire to foster high 
standards of methodology practices. Further, he also authored several articles that appeared in the 
journal. We wish to express our extreme gratitude to Gordon J. Brackstone. 

The current issue contains eight regular papers on a variety of topics, and three short 
communications. As mentioned in the previous issue of the journal, we are introducing a new Short 
Communications section in Survey Methodology. This section will contain shorter papers, typically 
around four pages. Possible topics of short communications include presentation of new ideas without 
the full development of a regular paper, brief reports of empirical work, and discussions or 
supplements related to other papers published in the journal. 

For the past four years the June issue of Survey Methodology has included an invited paper in 
honour of Joseph Waksberg. Starting this year, this annual invited paper will be published in the 
December issue of the journal, bringing it more in line with the associated Waksberg address delivered 
at Statistics Canada’s annual methodology symposium in the autumn. The author of this year’s 
Waksberg paper is J.N.K. Rao and his paper will be on the “Interplay Between Sample Survey Theory 
and Methods: an Appraisal”. 

In the opening paper of this issue, Winglee, Valliant and Scheuren present a new simulation 
approach to estimation of error rates for threshold selection in record linkage. For each potential 
matched pair there is a vector of comparison outcomes that determines the linkage weight. A 
multinomial model is assumed for each comparison outcome, with different multinomial distributions 
for true matches and true non-matches. The distributions are estimated from a sample, and then used 
to simulate the distributions of the linkage weights for true matches and true non-matches. The method 
is illustrated in a case study using data from the U.S. Medical Expenditure Panel Survey (MEPS). 

Krewski, Dewanji, Wang, Bartlett, Zielinski and Mallick investigate the effects of record linkage 
errors, both false positives and false negatives, on risk estimates in cohort studies. They show 
analytically how linkage errors introduce both bias and additional variability into observed and 
expected numbers of deaths, as well as into estimates of standardized mortality ratios and relative risk 
regression coefficients. They discuss their results in their conclusions, and point to further work that 
needs to be done in this area. 

The paper by van den Brakel and Renssen addresses the problem of testing hypotheses between 
different survey implementations, such as different questionnaire designs, when a complex sampling 
design is used. A design-based theory is developed for cases where the survey implementations are 
assigned to subsamples through completely randomized experimental designs or randomized block 
experimental designs. The theory also makes use of measurement error models. Design-based Wald 
Statistics are used to compare the different survey implementations. 

Tsuchiya approaches the long-standing problem of asking respondents sensitive questions in an 
interesting fashion. Instead of using the randomized response approach that allows little control for 
the researcher, he proposes that the item count technique be adapted for sensitive questions. The item 
count technique presents the respondent with a list of several phrases, from which the respondent 
selects all that apply to him. The researcher constructs the list in two ways: the first list contains the 
sensitive phrase while the second list does not. Tsuchiya presents various estimators for this technique 
and gives an interesting example related to the Japanese national character. 

In the paper by DiZio, Guarnera and Luzi, finite mixture models are used to detect errors that are 
due to an incorrect unit of measurement at the collection stage of the survey. In a multivariate context 
and assuming that the data are multivariate normal, the procedure can identify which variables are in 
error for a given sampled unit. The authors also provide diagnostics for prioritizing cases to be 
investigated more deeply through clerical review. The proposed methodology is illustrated through an 
example with simulated data and an example with real data. 
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2 In This Issue 


Chiu, Yucel, Zanutto and Zaslavsky present a method for multiple imputation of missing contextual 
variables for use in regression analysis. For each record missing the variable, and for a sample of 
complete records, matched cases are selected based on a set of matching variables. The sample of 
complete records is then used to estimate a regression adjustment for other variables not included 
among the matching variables. The contextual variables for the incomplete records are then multiply 
imputed. The authors then show an application to a colorectal cancer study, and use simulations to 
compare their approach to three other nonresponse adjustment methods. 

Nandram and Choi examine the important problem of nonignorable nonresponse in small-area 
estimation of a health status variable. When confronted with an example where the usual estimators 
are biased because of the excessive number of nonrespondents, they attempt to account for the 
differences through modeling. Nandram and Choi use two nonignorable nonresponse hierarchical 
Bayes models, a selection model and a pattern model, to analyze the health data. An important 
consideration to their modeling is the incorporation of the input from doctors concerning the 
nonresponse pattern and the outcome variable. The results give an accurate non-response adjustment 
and a better measure of precision. 

Park and Fuller propose a method to reduce the probability of obtaining negative estimation 
weights when using a regression estimator. Their method consists of first approximating inclusion 
probabilities, conditional on Horvitz-Thompson estimates for a vector of auxiliary variables, and then 
using these approximate conditional inclusion probabilities as initial weights in a regression estimator. 
Their method is shown to work well in a simulation study. The weights obtained from this method are 
also compared to weights from quadratic programming, the raking ratio, the logit procedure and 
maximum likelihood. 

In the first of three short communications included in this issue, Andersson and Thorburn show that 
the optimal regression estimator can be expressed as a calibration estimator with an appropriately 
chosen distance function. The resulting optimal estimator is asymptotically more efficient than the 
usual Generalized Regression (GREG) estimator. A small simulation study illustrates several 
situations where the optimal estimator if significantly more efficient than the GREG estimator. 

Lynn and Gabler extend the results of Gabler, Hader and Lahiri (volume 25, 1999) on Kish’s 
expression for the design effect due to clustering. They give a practical approach to estimating Kish’s 
quantity at the sample design stage when only the total numbers of observations and of clusters are 
needed. 

Meza and Lahiri examine the limitations of a standard regression model selection criterion, 
Mallows’ statistic, for nested error regression models. They show, that while a straightforward 
application of Mallows’ statistic may result inefficient model selection methods, a suitable 
transformation of the data may be the answer. 

Finally, we would like to inform you that Harold Mantel will now hold the new position of Deputy 
Editor. Harold has been part of the Editorial Board for the last 15 years. His dedication to the journal 
has been notable and his continuous involvement in the editorial process has been instrumental in 
ensuring that Survey Methodology remains a high quality publication. 


M.P. Singh 
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A Case Study in Record Linkage 


M. Winglee, R. Valliant and F. Scheuren ' 


Abstract 


Record linkage is a process of pairing records from two files and trying to select the pairs that belong to the same entity. The 
basic framework uses a match weight to measure the likelihood of a correct match and a decision rule to assign record pairs 
as “true” or “false” match pairs. Weight thresholds for selecting a record pair as matched or unmatched depend on the 
desired control over linkage errors. Current methods to determine the selection thresholds and estimate linkage errors can 
provide divergent results, depending on the type of linkage error and the approach to linkage. This paper presents a case 
study that uses existing linkage methods to link record pairs but a new simulation approach (SimRate) to help determine 
selection thresholds and estimate linkage errors. SimRate uses the observed distribution of data in matched and unmatched 
pairs to generate a large simulated set of record pairs, assigns a match weight to each pair based on specified match rules, 
and uses the weight curves of the simulated pairs for error estimation. 


Key Words: File matching; Linkage error rates; Match weight; Selection threshold; Medical records. 


1. Introduction 


The basic record linkage framework by Newcombe 
Kennedy, Axford and James (1959) and Fellegi and Sunter 
(1969) uses a match weight to measure the likelihood of a 
correct match and a decision rule to classify record pairs. 
The optimal decision rule uses two match weight thresholds 
for selection (an upper threshold above which a link is 
treated as a match and a lower threshold below which a link 
is treated as a nonmatch). The choice of these thresholds 
depends on the acceptable pre-set linkage error rate and the 
requirement to minimize the number of links with 
indeterminate status between the two thresholds. Nowadays, 
practitioners of computerized linkage systems often use a 
single selection threshold to avoid manual intervention of 
the indeterminate links. Linkage decisions are typically 
made automatically after the system is “tuned” to achieve 
pre-set error levels. The challenge is that current methods to 
determine the selection threshold and to estimate linkage 
errors can produce divergent results depending on the type 
of linkage error, the choice of comparison space, and the 
estimation method. 

This paper shares our experience with fellow practi- 
tioners who need a method to guide linkage selection and 
error estimation. Our case study used medical event files 
from the US Medical Expenditure Panel Survey (MEPS). 
MEPS collects medical expenditure data from both 
household respondents and their medical providers. The 
purpose is to combine the data from both sources for 
supporting annual estimations of medical utilization and 
expenditures (see Agency for Healthcare Research and 
Quality 2001 for more details on MEPS). 


Here we discuss the linkage with three sets of annual 
medical event files - MEPS 1996, MEPS 1997, and MEPS 
1998. Each set consisted of a household file containing 
events reported by household respondents for a given year 
and a medical provider file containing the corresponding 
events reported by medical providers of the household 
respondents. On average, approximately 50,000 medical 
events were reported for close to 10,000 persons, and 
around 15,000 person-provider units each year. 

We used two model-based alternatives for linkage error 
estimation. One of these uses simulation to develop a 
distribution of the weights for various levels of agreement. 
This technique, called SimRate, begins by generating 
weight distributions for matched and unmatched record 
pairs. Using these, SimRate can then provide estimates of 
linkage error rates for different threshold levels. The error 
rates can then be used as a guide to action and a way to 
measure success. SimRate is contrasted with a second 
modeling approach created by Belin and Rubin (1995). As 
we hope to show, there is a role for both approaches; each 
has strengths as illustrated in the comparisons. 


2. Mixture Models and Simrate 
Approaches 


The mixture modeling method of linkage error esti- 
mation, as presented in Belin and Rubin (1995), has several 
attractive features. It is flexible in a sense that the weight 
creation process does not have to be considered directly. 
Hence, this method can be applicable to many different 
ways of creating weights. Once a model is specified, error 
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rates can be examined for a continuum of potential threshold 
values and confidence bands can be constructed to monitor 
the precision of error estimates (see section 7). 

Mixture modeling does have limitations. While the 
method provides a particular kind of error rate — the pro- 
portion of linked records that are actually unmatched pairs, 
overall false positive and false negative error rates cannot be 
estimated since nonlinked pairs are not considered. The 
error rate that is estimated is conditional on the set of linked 
pairs of records. Furthermore model parameters may be 
hard to estimate if the weight distributions for the matched 
and unmatched sets are not separable (see Winkler 1994). 

A key assumption in the Belin—Rubin approach is that it 
is possible to transform the distributions of the weights in 
the matched and unmatched sets to make them normal. Now 
a real difficulty exists here in that the transformed weights 
may be far from normal when the weight distribution for 
either the matched or unmatched sets is multimodal. 

Another critical requirement is to have a training data set 
whose characteristics are very similar to those that are to be 
matched. Without a good training data set, the input para- 
meter estimates for the mixture model may be poor, 
affecting the final estimated error rates obtained. Based on 
our application using annual medical event data repeated 
over three years, the parameters were not stable over time. 
This instability necessitated a training set for each year, 
making the Belin—Rubin approach impractical in our appli- 
cation because of the cost and time it required. 

The simulation approach, SimRate, like mixture 
modeling, has the ability to examine different thresholds, 
allowing the user to monitor both the sensitivity and 
specificity of the decision rule for selecting linked pairs. As 
long as the process used to create match weights can be 
realistically modeled, customized methods of weight 
assignment like the one used in the current case study can be 
accommodated. The method does require the generation of 
pairs of records using the distribution of characteristics for 
the matched and unmatched sets. Some effort is needed to 
realistically generate the populations of pairs. In our work 
we have been successful with multinomial models for 
generating these populations. 


3. Threshold Weight and Linkage Error 
Estimation 


Several methods are available in the literature for 
selecting true matches and for estimating linkage errors 
(e.g., Bartlett, Krewski, Wang and Zielinski 1993, 
Armstrong and Mayda 1993, Belin 1993, Belin and Rubin 
1995 and Winkler 1992, 1995). See Fellegi (1997) for an 
overview of evolutions in record linkage, Tepping (1968) 
and Larsen and Rubin (2001) for other linking methods, and 
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Scheuren (1983) for a capture-recapture method to estimate 
omission error. 

Comparison of estimates from the different approaches is 
complicated by the fact that each approach tends to focus on 
different error components. In fact, the methods used in the 
linkage literature to construct linkage error rates are some- 
what inconsistent. We illustrate this problem below. 

Table 1 shows a 2X2 contingency table tabulating the 
numbers of true matched and unmatched pairs and declared 
linked and nonlinked pairs selected by linkage systems. 
Estimates of linkage error rates can be constructed relative 
to the true totals shown in the columns. An estimate of false 
positive linkage error rate under the Fellegi and Sunter 
framework is {=P(A,|U)=n,,/n,, and that of false 
negative linkage error rate is A= P(A, |M) =n n Gee 
also Armstrong and Mayda 1993). These are the rates that 
SimRate is designed to estimate. They answer the question — 
“Of the set of true matched (or unmatched) pairs, what 
proportion is not correctly identified?” 


Table 1 
A Contingency Table for Evaluating Linkage Errors 
True set 
Declared set Match (M) Unmatch (U) Declared total 
nay 203 
Link (A; ) true positive false positive nh 
Ny} ny 
Nonlink (A,) false negative _true negative Nye 
True total Ng} Ne? Nee 


Some linkage evaluations have also considered rates 
relative to the declared totals in the rows. For instance, 
Gomatam, Carter, Ariet and Mitchell (2002) used n,,/n,, 
and labeled it the positive predictive power of the linkage 
system. Others, however, have labeled this as the false 
match rate (Belin and Rubin 1995) or false positive declared 
rate (Bartlett et al. 1993). Rates constructed in this manner 
answer the question — “Of the declared linked (or nonlinked) 
pairs, what proportions are wrong?” Both questions are 
important in selecting matched pairs and should be 
addressed. That is one of the appeals in employing both 
SimRate and Belin—Rubin, if possible. 


4. Simrate Weight Distribution 
Methods to Estimate Linkage Error 


How to best estimate the linkage errors, given a limited 
budget and time schedule, is a difficult question. Accurate 
estimation of linkage errors should depend on at least two 
factors — the power of the identifying fields to unambi- 
guously identify events that are true matches and the linkage 
method used. Taken together it is then possible, in a given 
setting, to specify linkage categories, estimate agreement 
probabilities, and determine match weights. 
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Following Newcombe and Kennedy (1962) and Jaro 
(1989), we adopt a weight distribution approach in our 
application that can take all these factors into consideration. 
The basic step is to first compute the match weight and 
order all possible configurations of agreement and dis- 
agreement outcomes of the comparison fields by match 
weight. Then we plot the cumulative distribution function of 
the weights for matched and unmatched pairs, and use the 
resulting weight chart to determine thresholds to attain 
desired levels of false positive and false negative error rates. 

An ideal method to develop these curves might be to 
begin with a set of record pairs for which the truth is known. 
If resources are available, we could use a large set of true 
matched pairs, order them by match weight, and observe 
what proportion is above or below a given threshold. 
Similarly, we could take a large set of pairs, known to be 
true unmatched pairs, order them by weights, and again 
tabulate the proportion on either side of the threshold. The 
proportion of true matched pairs with weights below the 
threshold and the proportion of true unmatched pairs with 
weights above the threshold would then be estimates of the 
error rates associated with the way in which the matching 
algorithm is implemented. 

One method to approximate this “ideal” approach (see 
also Bartlett et al. 1993) is to sample record pairs and use 
manual review to determine the true match status. Once the 
true pairs are known, we can attach the match weights from 
whatever linkage system is being used and then develop 
cumulative weight distributions, as discussed above. This 
method is, of course, subject to the well-known time and 
other resource limitations of manual review and is seldom 
practical with a large sample. 

An alternative method is to generate the cumulative 
weight distributions through simulation. That is the heart of 
the SimRate approach. To explain in some detail, denote a 
record pair by r and a_ comparison field by 
v(v=1,..., V fields). The comparison outcome situations in 
our application included partial agreements and multiple 
outcome categories beyond the basic agreement and dis- 
agreement categories (see also Newcombe 1988). There- 
fore, we denote that each field v has i=1,...,c, outcome 
categories. The outcome indicator is y,, =(Yn15-++> Yn, )» 
a vector of indicators showing the category into which pair r 
falls. One of the values of y,,, will be 1 and the others O for 
each field. 

The particular theory supporting the SimRate approach is 
to assume that y,,, has one multinomial distribution if pair 
r is a matched pair and a different multinomial distribution 
if it is an unmatched pair. We can then model the y,, 
vectors as having a multinomial distribution with para- 
meters m,=(m,,,...,m,,.) if the pair is a matched pair 
and parameters u,=(u,,,...,u,.) if the pair is an 
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unmatched pair. Then the probability m,, =P (field v 
category i agrees in pair r|re M) is the conditional proba- 
bility of agreement for field v category J, given that the 
record pair r is in the set M of true matched pairs. In 
contrast, the probability u,, = P (field v category i agrees in 
pair r|reU) is the conditional probability of agreement 
for field v category /, given that the record pair r is in the set 
U of true unmatched pairs. Assuming independence of the 
matching variables, v=1,...,V, we can specify the joint 
probability of y, =(),,,..., y,,) if pair ris a match, as 


V Gy 
P(y,|jreM) =] [][m,,". 
v=l i=l 
The corresponding probability of the same configuration of 
data, if the pair is really an unmatched pair, is 


V CG 
P(y,|reU) =] [] «7. 

v=l i=l 
SimRate uses Monte Carlo simulation methods to 
generate a large number of realizations of matched pairs and 
unmatched pairs using estimates of the probabilities m,, and 
u,,- For each simulated pair, a match weight w,, which 
applies to a given configuration of data, is calculated. For a 
given realization y,, a weight w, is computed for the pair 
by summing the weights for the randomly generated 
categories that the pair fell into. The match weight w, of a 


record pair is typically estimated as 


v=l i=l 


See section 6 on the match weights used in our simulation. 

The cumulative distribution of these weights for the 
simulated matched pairs is then plotted as “Sim-M”. 
Similarly, the reverse cumulative distribution for the 
unmatched pairs is plotted to generate “Sim—U” (see Figure 
1, section 8, for an example of the simulation curves used in 
this study). The simulated proportion of matched pairs 
whose weights are below the cutoff is the estimate of the 
false negative error rate. The simulation proportion of 
unmatched pairs whose weights are above the cutoff is the 
estimate of the false positive error rate. 

This approach requires that empirical estimates be made 
of the distributions among the matching variables of both 
true matched and true unmatched pairs. Even though the 
weight algorithm may involve the assumption of inde- 
pendence among matching variables, the actual data may 
show dependence. As long as artificial pairs can be gene- 
rated that realistically follow the observed distribution of the 
data (incorporating any dependencies), then this method 
should provide suitable error rate estimates. 
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In our case study, we modeled data fields as having inde- 
pendent multinomial distributions, but this may not be 
reasonable in other applications. The SimRate concept can 
apply to any algorithm where weights and a cutoff point are 
used for classification. Thus, methods other than Fellegi and 
Sunter (1969), like Belin and Rubin (1995), might also be 
evaluated in this way. If methods are needed to deal with 
dependent categorical variables, the multivariate multi- 
nomial distributions in Johnson, Kotz, and Balakrishnan 
(1997, Chapter 26) may be appropriate. However, in appli- 
cations similar to ours, the simplest procedure for 
accounting for dependence is to form cross-classifications of 
the variables that are related and to estimate probabilities for 
each cell in a cross-table. For example, if two variables with 
c, and c, categories are associated, then we can estimate 
the joint probability, p,, for each cell in the c,*c, table 
and use those in the simulation. Sparse data will naturally 
limit the number of cells for which this is feasible. But in the 
presence of sparse data, the penalty for model failure must 
be small. 


5. Record Linkage of MEPS Medical 
Events 


Record linkage of MEPS medical events used five identi- 
fying fields: event dates (year, month, day, and day-of- 
week), medical condition codes, procedure codes, global-fee 
codes, and lengths (number of days) of hospital stay. These 
fields are described in more detail in Winglee, Valliant, 
Brick and Machlin (2000). A training sample from MEPS 
1996 was employed to derive match rules and outcome cate- 
gories and to estimate the probabilities of agreement for 
each category, allowing for partial agreement and value 
specific outcomes. The same match rules were repeated 
each year with minor adjustments of the matching para- 
meters. 

For the training set we used the linkage system Auto- 
match (Matchware 1996) and the unique match algorithm to 
select linked pairs. In “unique” matching, a File A record is 
optimally linked to only one File B record (Jaro 1989). In 
addition, we used the many-to-many match algorithm to 
generate a random sample of nonlinked pairs to facilitate 
linkage error estimation. However, the methods for esti- 
mating error rates, described below, apply to any software 
that implements the linkage methods based on match 
weights. They are not specific to Automatch. 

The tradeoff in determining the selection threshold for 
MEPS was between getting a high match rate and limiting 
mismatch linkage errors. A high threshold weight would 
minimize false positive (mismatch) errors at the expense of 
lowering the match rate and losing valuable data collected 
from medical providers. On the other hand, a low threshold 
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would increase false positive error and may affect the 
allocation of expenditure data in a way that would require 
special analytic techniques to overcome and even then only 
with uncertainty. Since both data sources had reported on 
ostensibly the same medical events for the same persons 
over the same period, the strategy was to maintain a 
reasonably high match rate and to conduct a manual review 
of a limited number of questionable linked pairs after 
selection to assess the analytic impact of falsely accepting 
them. Based on this decision the average match rate for the 
annual MEPS medical records files was about 85 percent. 

The 1996 MEPS training sample M curve, labeled the 
“Tra—M” curve, was generated by applying match weights 
to “true” matched pairs for a random sample of 500 persons 
in MEPS 1996. For these persons, the manual review files 
contained 2,507 events from household respondents and 
2,804 events from medical providers. Knowledgeable data 
managers reviewed the events and selected 1,501 pairs. We 
considered these as the true matched pairs in this evaluation. 
The manually matched pairs were assigned the weights 
derived from our match specification to generate a cumu- 
lative distribution function. 

The 1996 training sample U curve, labeled the “Tra-U” 
curve, was generated using a random sample of unmatched 
pairs. We used a simple random sampling with replacement 
method to select 500 events each from the matching files 
and employed a many-to-many match algorithm to generate 
all 250,000 possible event pairs. For these randomly 
selected sets of pairs, the chance of there being any correctly 
matched pairs is negligible; thus, the entire set was taken to 
consist of unmatched pairs. We applied the match weights 
from our matching specification and plotted the “Tra-U” 
curve equal to 1 minus the cumulative distribution of the 
weights of these pairs. Figure 1 in section 8 shows both the 
Tra-M and Tra—U curves for the 1996 MEPS. The curves 
shown in this figure were smoothed using a nonparametric 
lowess function (Chamber, Cleveland, Kleiner and Tukey 
1983) in S-PLUS 2000 (1999). 


6. Simrate Implementation in MEPS 


The SimRate weight distribution method used Monte 
Carlo simulation methods to generate separate sets of 
10,000 simulated matched and unmatched pairs for creating 
the weight curves. To generate the “Sim-M” weight 
distributions we estimated the probabilities m,, from linked 
pairs assigned by a unique matching algorithm. We used the 
“tuned” linkage system to select matched pairs from the 
1996 annual matching files and tabulated the observed 
frequencies for each outcome category for each of the five 
matching fields. The proportion of pairs that fell into 
category i of field v was then used as the estimate ™m,, of the 
probability m,,. 
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For the unmatched pairs and the “Sim-—U” curve, the u,, 
probabilities for unmatched pairs were estimated using the 
same sample of unmatched pairs used in creating the 
“Tra—U” curve. The difference is that we used these pairs to 
observe the relative frequencies for each outcome category 
for each of the five matching fields among unmatched pairs. 
The proportion of pairs that fell into category i of field v was 
then used as the estimate i, of the probability w,,,. 

For a simulated matched pair, a realization of the 
multinomial random variable y,, was generated for each 
match field. For example, a configuration like (agreement 
on event date, agreement on length of hospital stay, 
agreement on the array of condition codes, joint agreement 
by type of procedure, and value specific agreement for a 
global-fee indicator) was generated using the match 
probabilities m,, for each outcome category. Similarly, for 
each unmatched pair, a realization was generated of a 
category for each of the five fields using the unmatched 
probabilities @,, discussed above. 

For a given realization y,, a weight w, was computed 
for the pair by summing the weights for the randomly 
generated categories that the pair fell into. The actual 
weights used in our simulation were adjusted ones that we 
specified rather than ones defined directly by the matching 
software (see Winglee, et al. 2000). Thus, we are simulating 
the way in which matching would actually be implemented. 
To do this we calculated the match weight for both the 
matched and unmatched sets of 10,000 pairs and plotted the 
simulated match weight functions. 

Table 2 shows examples of some the partial agreement 
categories used for matching event date and the estimates of 
m,;,U,;, and w, used in SimRate simulation. We defined a 
total of 19 outcome categories for matching by event date, 9 
categories for duration of hospital stay, 27 categories by 
medical procedures, and 3 categories each for medical 
conditions and global fee. For example, for the outcome 
category exact agreement on event date, the estimate of m,, 
was ().69, meaning that 69 percent of the linked pairs had 
exact agreement on event date. The estimate of “#,, for this 
outcome category was 0.003, showing that only 0.3 percent 
of the unlinked pair showed agreement on this field. The 
match weight for exact agreement on date of event was 8.52 
and that for complete disagreement (difference of more than 
two weeks apart and on different day of week) was —6.64. 
(see Winglee, et al. 2000 for the match weights by match 
fields and outcome categories). 

We selected the match fields that were approximately 
independent in this case study. For example, we found no 
functional association between the date of medical events 
and other match fields like medical condition and length of 
hospital stay. For fields such as the indicators for surgery, 
radiology, and laboratory procedures, we used chi-square 


ri 


tests and found some dependence between the concurrence 
of surgery and radiology. To handle this situation, we 
estimated the joint probabilities and specified match rules to 
treat these procedure flags as a single match field (see 
section 4 above). Hence, we could then apply the 
independent multinomial distribution for simulation. 


Table 2 
Estimates of Multinomial Probabilies for Matched Pairs 
(m,;) and Unmatched Pairs (i,;), and Match Weights (w,,) 
for the Match Field Event Date 


A A 


Match rule for Event Date m 


vi Uy; Wyj 
Missing 0.031 0.046 0.00 
Exact match 0.693 0.003 8.52 
Off +/— 1 day 0.068 0.006 Saif 
Off +/— 3 day 0.023 0.005 4.09 
Off +/— 5 day 0.014 0.005 WA 
Off +/— 7 day 0.030 0.006 2.84 
Match by day of week only 0.014 0.034 —-—3.64 
Disagree 0.003 0.547 —6.64 


Table 3 shows the results of linkage error estimates from 
SimRate and the training curves at the threshold weight of 
w=1 for MEPS 1996, MEPS 1997, and MEPS 1998. 
SimRate was easy to repeat each year. Repeating the 
manual-based weight curves, however, depended in part on 
manual review and we had only one reliable training 
sample, that for 1996. Note that the linked pairs used in 
SimRate will naturally generate some percentage of false 
positives and false negatives, i.e., some matched and 
unmatched pairs are incorrect. Thus, the m,, probabilities 
computed in this way for the identified fields are subject to 
error. It would have been preferable to estimate the m 
probabilities from a “‘truth” set where we were confident 
that all matches were correct. However, the manually 
matched training sets we were able to produce were too 
small to yield stable estimates in all of the detailed match 
categories and manual selection is also imperfect. This 
difference may explain in part the slightly higher overall 
error rate estimates from SimRate than from the training 
sample weight curves. 


Table 3 
Weight Curve Methods to Estimate Linkage Error Rates at 
Threshold Weight 1, MEPS 1996 — 1998 


Method Error Rate 1996 1997 1998 

SimRate simulation curves False negative D2 Gio! * 5:6 
False positive 9.05, P2692 7-6 

Training sample curves False negative* os) Cree Oe ee, 


False positive** SES) Cony, 
* Estimates from the 1996 Tra—M curve were used for all three 


years. 

** Estimates from the 1996 Tra—U curve used samples of 500 records 
from each match file and a total of 250,000 unmatched pairs. The 
1997 and 1998 estimates used different Tra—U curves employing 
samples of 1,000 records from each match file and a total of 
1,000,000 unmatched pairs. 
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7. Mixture Model Implementation 
in MEPS 


A mixture modeling approach by Belin and Rubin (1995) 
views the distribution of observed match weights from a 
computerized linkage system as a mixture of weights for 
true matches and false matches. In principle, the mixture 
model method has two attractive features suitable for 
MEPS. First, it can handle repeated applications efficiently. 
When global parameter estimates of the transformed para- 
meters and the ratio of the variances of the two distributions 
are available, these estimates can be applied to similar data 
for estimation. Since the MEPS record linkage is done 
annually, global estimates derived from early training 
samples could conceivably be applied for linkage error esti- 
mation in later years when manual review samples were not 
available. 

The second advantage is that the mixture model can draw 
from multiple sets of parameter estimates from different 
training samples and can reflect variations. This feature is 
especially appealing for MEPS because manual review is a 
complex process and not necessarily always accurate. 
Hence, an alternative is to view the computer system 
selection as the truth and use them to provide an alternative 
set of parameter estimates. This process can also be repeated 
using training samples from more than one year. 

Our application of the Belin—Rubin approach used the 
same training samples from MEPS 1996 and a second 
training sample of the same size from 1997. Following 
Belin—Rubin’s examples, we applied the mixture modeling 
method using manually identified true and false match pairs 
from a one-to-one matching system (note that such systems 
provide relatively few false match pairs for estimation). We 
computed model estimates for MEPS 1996 and MEPS 1997 
assuming the manual selection to be the truth, and for 
testing the behavior of the model, we computed a second set 
of estimates assuming computer system selected match pairs 
to be the true pairs. 

Implementation involved two procedures — the Box and 
Cox (1964) procedure for global parameter estimation and 
the Calibrate procedure (Belin and Rubin 1995) to fit a 
mixture model for error rate estimation. Before applying 
Box-—Cox, the weights were rescaled between 1 and 1,000. 
The Box—Cox transformation discussed by Belin and Rubin 
(1995) was 


Wl 


ay ah 

r yw! 
where w, is the match weight for pair r, w is the geometric 
mean of the w, weight, and y is a parameter that is 
dependent on whether the pair is in the matched or 
unmatched set. 
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For the mixture model procedure to be effective, the 
transformed weights should be approximately normally 
distributed. The untransformed weight distribution with our 
data showed bimodality and almost no overlap in match 
weight between matched and unmatched pairs (bimodality 
was also observed in Belin-Rubin 1995). For example, 
application of their transformation procedure to the 1996 
MEPS system pairs resulted in parameter estimates of 
w=585.7 and y=1.15 for the true matched pairs and 
w =113.1 and y=0.48 for the false matched pairs. The 
transformed weights, however, showed relatively little 
improvement towards normality. Since the match weights 
are the log of a product, or the sum of logs, we might hope 
that the weights would be normally distributed if there were 
many components in the sum. However, we had only five 
fields to use for matching. The small number of fields may 
have accounted in part for the lack of normality with our 
transformed data. 

Table 4 shows the results of applying the Belin—Rubin 
mixture model to MEPS 1996. This table shows the model 
estimated false match rates, the 95 percent confidence 
interval of the estimated rate, and the actual observed false 
match rate at the threshold weight of 1. Using the manual 
review pairs as the true matched pairs, the model estimate of 
the expected false match rate at the threshold of w=1 was 
9.1 percent, with a 95 percent confidence interval ranging 
between 6.0 and 12.2. The actual observed false match error 
rate, however, was 14.5 percent, which is higher than the 
upper 95 percent confidence bound. Note that these are rates 
of the form 7n,,/n, in Table 1. These are not the same rates 
estimated by SimRate and the weight curve approach. 


Table 4 
Mixture Model Linkage Error Estimates 


Percentage false match error 


MEPS 1996 Expected Lower Upper Observed 
rate Bound* Bound* rate 
Manual match 9.1 6.0 re 14.5 
System match 0.9 0.6 1.2 0.0 
* — The lower and upper bounds are the 95 percent confidence interval 
of the expected error rate. 


Since manual review may not always be accurate, an 
option, for the purpose of evaluation, is to treat the computer 
system linked pairs as the truth matched pairs, and use them 
for modeling. Under this assumption, the model estimate of 
the expected error rate is 0.9, and a 95 percent confidence 
interval between 0.6 and 1.2. The actual observed rate in 
this case, 0 percent, was a hypothetical outcome treating the 
computer-linked pairs as correct. Of course, in reality there 
will be some nonzero level of error so that the mixture 
model confidence interval is not necessarily wrong. 

We generated global parameter estimates using both the 
training sample manual selections and system selections for 
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MEPS 1996 and MEPS 1997 and used them as four sets of 
inputs to provide global estimates for modeling linkage 
error for MEPS 1998. This should be possible because the 
data remained similar and record pairs were selected using 
the same match rules for all 3 years. A difference was that 
manual review was not conducted for MEPS 1998 and we 
could not use the Box—Cox procedure for global parameter 
estimation for 1998 (because there was no separate manual 
indicator for true and false pairs). For this application, we 
use a bootstrap method in the Belin and Rubin Calibrate 
procedure to draw from multiple parameter sets to reflect 
uncertainties in estimation. This application, however, did 
not converge after 150 iterations of the estimation proce- 
dure. We could only conclude that the global parameter 
estimates from earlier training samples failed to generalize 
and provide error rate estimates for repeated linkage 
applications. 


8. Concluding Comments and Analytic 
Implications 


The process of threshold selection and linkage error 
estimation is an iterative process involving repeated cycles 
of observation, estimation, and modeling. Our case study 
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employed modeling approaches for estimating linkage 
errors and for monitoring the predictive power of the 
linkage system. Both methods provided valuable informa- 
tion for determining the linkage selection and for evaluating 
the quality of the declared matched pairs as we found in 
MEPS. 

The weight curves approach of estimation has the appeal 
that one can choose a selection threshold to attain the 
acceptable linkage error level. For example, Figure 1 shows 
the training sample and the SimRate simulation weight 
curves based on the 1996 MEPS matching files. A vertical 
line is drawn at the selection threshold weight of w=1; the 
error levels for 1996 MEPS (shown in Table 3) were then 
estimated by the cumulative percentage at threshold level. 
By sliding this threshold, one can aim to minimize the total 
linkage error by selecting a threshold at the crossing point of 
the M and U curves. In this case study, the optimal threshold 
suggested by both sets of weight curves is fairly consistent. 
We included a likelihood ratio scale in this figure to provide 
a rough likelihood interpretation of the match weight. For 
example, at the match weight of w=1, the likelihood ratio 
score is 2. This means that for records with a match weight 
of w=1 or above, the relative likelihood of being true pairs 
is at least 2 to 1. 
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Figure 1. Weight Curves for MEPS 1996 using the SimRate and Training Sample Methods; the dashed vertical reference 


line shows the threshold value of 1. 
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Figure 2. Mixture Model Estimates of False Match Rates by Weight, 1996 and 1997 MEPS Training 
Samples (a vertical line is drawn at weight = 1, which is threshold). 


For linked pair quality, Figure 2 shows the distributions 
of false match rate estimates from mixture modeling. This 
figure shows the model estimated false match rate, the upper 
and lower 95 percent confidence bounds of the error rate 
estimates, and the actual observed rates. Panel 1 shows the 
estimates treating the computer system linked pairs as the 
true matched pairs. Panels 2 and 3 show the estimates from 
the 1996 MEPS and 1997 MEPS training samples. The 
difference between Panels 2 and 3 shows the inconsistency 
of manual selection by different reviewers in our 
application. In all three panels, the 95 percent confidence 
interval of the model estimates failed to cover the true 
observed values. Ideally, one would use both Figure 1 and 
Figure 2 together to guide the choice of selection thresholds. 

We have found SimRate to be an informative and 
flexible tool for determining selection thresholds and 
estimating error rates in our application. Given multinomial 
or other models for the matching variables, the SimRate 
method provides error rate estimates that would be obtained 
from repeated application of the matching algorithm to a 
large number of candidate record pairs. It is also flexible in 
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accommodating the choices of comparison sets of pairs for 
computing rates. 

While our application achieved the matching and error 
rate estimation goals for MEPS, more work might be done 
prior to or during the analysis stage. Space does not permit 
us to develop these in the context of the current case study 
but two general approaches might be mentioned. First, it is 
possible to reweight the final results and adjust for false 
nonmatches — treating them in a manner analogous to unit 
nonresponse (e.g., as in Oh and Scheuren 1980). To handle 
mismatches, the ideas in Scheuren and Winkler (1993 and 
1997), and Lahiri and Larsen (2002) might be worth 
consulting. Whether these added steps are needed, of course, 
depends on the final uses to which the linked data will be 
put. 
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The Effect of Record Linkage Errors on Risk Estimates in Cohort 
Mortality Studies 


D. Krewski, A. Dewanji, Y. Wang, S. Bartlett, J.M. Zielinski and R. Mallick ' 


Abstract 


The advent of computerized record linkage methodology has facilitated the conduct of cohort mortality studies in which 
exposure data in one database are electronically linked with mortality data from another database. This, however, introduces 
linkage errors due to mismatching an individual from one database with a different individual from the other database. In 
this article, the impact of linkage errors on estimates of epidemiological indicators of risk such as standardized mortality 
ratios and relative risk regression model parameters is explored. It is shown that the observed and expected number of 
deaths are affected in opposite direction and, as a result, these indicators can be subject to bias and additional variability in 


the presence of linkage errors. 


Key Words: Cohort study; Computerized record linkage; Linkage errors; Linkage threshold weight; Poisson 
regression; Relative risk regression; Standardized mortality ratio. 


1. Introduction 


In recent years, a number of historical cohort studies 
have been carried out in environmental epidemiology using 
existing administrative databases as information sources 
(Howe and Spasoff 1986; Carpenter and Fair 1990). In 
general terms, this involves linking records of human 
exposure to environmental hazards with records on health 
status, often using computerized methods for matching 
individual records from different databases. In a cohort 
mortality study, the vital status of each cohort member is 
determined by linkage with mortality records maintained by 
government agencies. Excess mortality within the cohort 
relative to the general population may be due to exposures 
experienced by the cohort members. 

In specific terms, record linkage is the process of 
bringing together two or more separately recorded pieces of 
information pertaining to the same entity (Bartlett, Krewski, 
Wang and Zielinski 1993). Procedures for computerized 
record linkage (CRL) have become highly refined, using 
sophisticated algorithms to evaluate the likelihood of a 
correct match between two records (Hill 1988; Newcombe 
1988). Statistics Canada has developed a CRL system called 
CANLINK which is capable of handling both single file 
linkages and linkages between two separate files (Howe and 
Lindsay 1981; Smith and Silins 1981). In this system, 
weights reflecting the likelihood of a match are attached to 
pairs of records. Two thresholds are set: potential matches 


with linkage weights above the upper threshold are 
considered to be links whereas potential matches with 
weights below the lower threshold are considered to be 
nonlinks. Potential matches with weights between the upper 
and lower thresholds are resolved using additional in- 
formation when available. Otherwise, a single threshold is 
selected to discriminate between links and nonlinks. 

The confidentiality of records protected under the 
Statistics Act is strictly maintained in any study in which 
record linkage is employed. All studies requiring linkage 
with protected data bases must satisfy a rigorous review and 
approval process prior to implementation, following well- 
established procedures for data confidentiality (Singh, 
Feder, Dunteman and Yu 2001). All linked files with 
identifying information remain in the custody of Statistics 
Canada (Labossiére 1986). 

Computerized record linkage methods have been used to 
link environmental exposure data to the Canadian Mortality 
Data Base (CMDB). For example, a study of Canadian farm 
operators was initiated to investigate possible relationships 
between causes of death in over 326,000 farm operators in 
Canada and various socio-demographic and farming 
variables, particularly pesticide use (Jordan-Simpson, Fair 
and Poliquin 1990). In this study, the CMDB was linked 
with the 1971 Census of Population and the 1971 Census of 
Agriculture. Another ongoing large-scale study is based on 
the National Dose Registry (NDR) of Canada (Ashmore and 
Grogan 1985, Ashmore and Davies 1989). The NDR 
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contains information on occupational exposures to ionizing 
radiation experienced by over 400,000 Canadians dating 
back to 1950. The NDR has recently been linked to the 
CMDB to investigate associations between excess mortality 
due to cancer and occupational exposure to low levels of 
ionizing radiation (Ashmore, Krewski and Zielinski 1997; 
Ashmore, Krewski, Zielinski, Jiang, Semenciw and 
Létourneau 1998). More recently, the NDR has been linked 
to the Canadian Cancer Incidence Database (Sont, Zielinski, 
Ashmore, Jiang, Krewski, Fair, Band and Létourneau 2001). 
A comprehensive list of other health studies based on 
linking exposure data with the CMDB has been compiled 
by Fair (1989). 

The success of record linkage studies depends on the 
quality of databases being linked (Roos, Soodeen and 
Jebamani 2001). Using population based longitudinal 
administrative data, Roos et al. examined data quality issues 
in studies of health and health care. Ardal and Ennis (2001) 
considered systematic errors in administrative databases 
involved in secondary analysis of health information. 
Although record linkage studies will benefit from the use 
high quality data, limitations in data quality may be offset to 
a certain extent by the large sample sizes found in many 
administrative data bases. 

Record linkage studies have several advantages over 
traditional epidemiological studies. By using existing 
administrative databases, the need to collect new data for 
health studies is circumvented, and large sample sizes can 
often be achieved with relatively little effort. Depending on 
the nature of the databases utilized, record linkage provides 
an inexpensive way of exploring many possible associations 
in epidemiological studies. Record linkage also has certain 
disadvantages. There is generally little control over the 
information collected, and there can be appreciable loss to 
follow-up. Another disadvantage of record linkage is the 
occurrence of linkage errors, which is the focus of this 
paper. Inevitably, some records that match will fail to be 
linked, and other nonmatching records will be incorrectly 
linked. 

Relatively little work has been done to determine the 
impact of these linkage errors on statistical inferences. 
Neter, Maynes and Ramanathan (1965) used a simple linear 
regression model to analyze the impact of errors introduced 
during the matching process. Their results indicate that 
linkage errors inflate the residual variance and introduce 
bias into the estimated slope parameter. Winkler and 
Scheuren (1991) derived an expression for the bias in 
estimates of linear regression coefficients due to linkage 
errors. Advances in the estimation of linkage error rates by 
Belin and Rubin (1991) enabled Scheuren and Winkler 
(1993) to implement an improved bias adjustment 
procedure. Linear regression methods for the analysis of 
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computer matched data files are further discussed by 
Scheuren and Winkler (1997). 

The purpose of this paper is to explore the impact of 
linkage errors on statistical inferences in cohort mortality 
studies. Relative risk regression models employed in the 
analysis of data from such studies are described in section 2, 
and expressions for the observed and expected numbers of 
deaths based on these models developed. The impact of 
linkage errors on the observed and expected number of 
deaths and person-years at risk is discussed in section 3. An 
analysis of the impact of linkage errors on estimates of 
standardized mortality ratios (SMRs) and relative risk 
regression parameters is given in section 4. Both types of 
errors can cause bias and additional variability in estimates 
of these parameters. Our conclusions are presented in 
section 5. 


2. Relative Risk Regression Models 


Statistical methods for the analysis of cohort mortality 
studies are well established (Breslow and Day 1987). The 
primary objective of such analysis is to determine if the 
exposure to the agent of interest increases the mortality rate 
among cohort members. Mortality is characterized by the 
hazard function, which specifies the death rate as a function 
of time. Letting T denote the time of death, the hazard 
function at time wu is formally defined as 


Pr{usT <ut+Au|T >u} 


CSN Au 


(1) 

Let X,(u) denote the hazard function for a specific cause of 
death at time u for individual i=1,..., N in a cohort of 
size N, and let z;(u) represent a corresponding vector of 
covariates specific to that individual. We assume that the 
effect of these covariates is to modify the baseline hazard 
X*(u) in accordance with the relative risk regression model 


A(u)=M (uy y{ Bz; (w)}, (2) 


where y is a positive function of the covariates and B is a 
vector of regression parameters. 

Two special cases of the general relative risk regression 
model of particular interest are the multiplicative and 
additive risk regression models. Define the function in (2) 
by 


logy(z) eae 3) 


When p=1, the general relative risk regression model 
reduces to be the multiplicative risk regression model 


A, (u) =a (u)exp{B’ z; (u)}, (4) 
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This proportional hazards model was introduced by Cox 
(1972), and is widely used in the analysis of mortality data 
(Kalbfleish and Prentice 1980). The additive risk regression 
model 


1, (u) = 2 (u) +B’ z, (u) (5) 


occurs as a limiting case as p > 0. 

Let t? and ft} be the age at the time of entry into the 
study, and the age at the time of loss to follow-up (due to 
withdrawal from the study, termination of the study, or 
death) for the i subject of the cohort, respectively. Let 
5, =1 or 0, according to whether the i individual has or 
has not died at the time of loss to follow-up. The log- 
likelihood function based on the relative risk model (2) may 
be written as 


y (Piles (y{B’2; (7) 
logL=)° . (6) 


= fl Bz.) 8 (du 


When there is a single covariate z,(u)=1, the maximum 
likelihood estimate of 6 = exp{f} reduces to the standard- 
ized mortality ratio SMR = OBS/EXP, where OBS = 
>, 6, and EXP = >, e, are the observed and expected 
numbers of deaths, respectively, with e, = | “ Ws (u) du. 

Maximization of the likelihood function (6) can be 
computationally burdensome with large sample sizes. 
Breslow, Lubin and Langholz (1983) simplify the likelihood 
by assuming that the covariates take on constant values 
within states through which a subject passes during the 
course of the study. The states are defined by cross- 
classification of the covariates of interest. Specifically, 
suppose that there are J such states {S p P=, 9.0) SUCK 
that z,;(u) =z, whenever the i subject is in S, at time u. 
These states are mutually exclusive and exhaustive, so that 
at any given time u, each member of the cohort will fall 
into one and only one state. The log-likelihood function (6) 
may then be written as 


logL=> {d, log(y{B’z,})-Y1B'z,}e,}, 7) 
j=l 


where 


N’ * 
Z, “Db ibaa (u)du (8) 


is the contribution to the expected number of deaths from all 
person-years of observation in the state S,, and d, 
denotes the total number of deaths in that state. Letting 
A ;(B) =log(y{B’z,}), the maximum likelihood estimate 
B of B is obtained as the solution to the score equation 


dA , (B) 
OB ey crn 4 OD 


{d , —exp{A , (B)}e,; }=0.(9) 
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3. The Effect of Linkage Errors on the Observed 
and Expected Numbers of Deaths 


Two principal types of errors can occur when linking 
data files in CRL (Fellegi and Sunter 1969). A false positive 
occurs when a member of the cohort who is alive is 
incorrectly identified as dead, and a false negative occurs 
when a deceased member is considered to be alive. More 
specifically, for the mathematical development to follow, a 
false positive occurs in a particular state when an individual 
who remains alive throughout this state is incorrectly 
labelled as dead in this state. Similarly, a false negative 
occurs in a particular state when a member, who died before 
or during the sojourn in this state, is considered to be alive 
throughout this state. Within a particular state, false 
positives and false negatives thus represent special cases of 
misclassification error discussed by Anderson (1974, 
chapter 6.2.1). In this section, we will discuss the effect of 
these two types of linkage errors on the observed and 
expected numbers of deaths, respectively. To do this, we 
first define sets of indices within states which will be used to 
represent sets of correctly matched and incorrectly matched 
records. 


3.1 Linkage Errors 


Let A; and D, denote the set of labels for those individ- 
uals in the cohort who remain alive throughout state S$ e 
and those who are dead in S,, respectively. Write D,, as 
the subset of D, corresponding to those individuals who 
have died in S$ ip Let A}, Dt and Dt denote the corre- 
sponding sets in the presence of linkage errors. We further 
define DP as the set of labels of those alive in S$ j (that is, 
in A,) but labeled as dead in S, corresponding to the false 
positives in S,. Similarly, A’Y is the set of those dead in 
S, (that is, in D,) but labeled as alive in S,; corresponding 
to the false negatives in S,;. Let us also write Dj as the 
subset of DP corresponding to those who are labeled to 
have died in S, and, similary, Ai as the subset of A; 
who have died in S, (that is,in D,,). These sets satisfy the 
relations At =(A,—-D?P)UA, D’ = (D, ALD. 


The effect of linkage errors on the likelihood function in 
(7) may be described as follows. Let ti denote the time at 
which the i individual enters, actually or by linkage error, 
the j® state S me Similarly, t}, denotes the time of death (if 
it occurs, actually or by linkage error) for the 7 individual 
in S, and tj the time of leaving S,, actually or by linkage 
error. Note that, if ti exists, it is less than or equal to tz. 
Let us, for the sake of simplicity, assume that tii, if exists, is 
equal to Lis that is, all the deaths in a state occur at the 
corresponding entry times in that state. Although this will 
underestimate the expected number of deaths, for the 
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purpose of studying bias, it may not be that objectionable. 
Assuming all the deaths to occur at the times of leaving the 
corresponding states also offers similar simplification. 
Using (8) and the decomposition of A/, the expected 
number of deaths e/ in S, the presence of linkage errors 
can be written as 

aE [fe cod 


ees 
ie Aj 


= De fu XN (u)du + Ds fe X (u)du 
ie A; E ie AN se 


~> fie Wau 


ie D; d 
=, Ne (10) 
where 
tpn # 
Otieam ik M(u)du, and Ae,=e? -e" (11) 
ieA, 
with 


ef =D [8 du and e” = YJ 0 (w)du. (2) 
iceAN 


ieD? 


For notational convenience, let us write 7, (i, j) for 
{ é A (uw) du in what follows. The term Ae, represents 
the bias in the expected number of deaths in the j* state 
due to linkage errors. It follows from (10) and (11) that the 
false positives tend to reduce the expected number of deaths 
and the false negatives tend to increase the expected number 
of deaths. 

Using the decomposition for Df, the observed number 
of deaths d/ in the presence of linkage errors may be 
written as 


di=d,+Ad,, (13) 
where 

Ad, =d* —a‘, (14) 
with d,,, df and a‘! denoting the number of individuals in 
the sets D,, Di; and Aj, respectively. The term Ad , 


represents the difference between the observed number of 
deaths in the j state due to linkage errors. It follows from 
(13) and (14) that the false positives will increase the 
observed number of deaths and the false negatives will 
reduce the observed number of deaths. 

Vital status is often determined by linkage with the 
CMDB, which is generally much larger than the cohort of 
interest. When the exposure records of a live individual are 
incorrectly associated with those of a dead person, the 
deceased individual usually does not belong to the cohort. 
Thus, the person-years at risk contributed by the person 
remaining alive will end prematurely in the year of 
presumed death; the lost person-years at risk correspond to 
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the time period from the year of presumed death until the 
end of the follow-up. On the other hand, when the exposure 
records of a dead individual are incorrectly associated with 
those of a live person, the person-years at risk contributed 
by this individual will include an extra period from the 
actual death-year to the end of the follow-up. Thus, false 
positives will deflate the number of person-years at risk and 
false negatives will inflate the number of person-years at 
risk in the cohort. 


3.2 Expectations and Variances of Differences 
Between the Observed and Expected Numbers 
of Deaths 


The effect of linkage errors on the observed and expected 

numbers of deaths depends on the false positive and false 
negative rates. Let p? and p% denote the false positive 
and false negative rates, respectively, in S,, for j=l, ..., 
J, which are assumed to be constant within S, and same 
for all the individuals in A, and D,,, respectively. This as- 
sumption is reasonable whenever individuals in the same 
state are highly homogeneous, particularly with respect to 
attributes such as the quality of personal identifiers that 
influence linkage error rates. Although this idealized as- 
sumption is unlikely to be fully satisfied in practice, it 
affords considerable simplification in the subsequent evalu- 
ation of the effects of linkage errors. Formally, p? (pi) is 
the conditional probability that an individual in A,(D;) is 
labeled dead (alive) in S,. That is, p? = Plie DP ie A,] 
and p¥ = Plie AN jie D,]. 
Let us write aj, d,, a’ and dP as the number of 
individuals in A,, D,, Ay and D?, respectively. Then, 
note that, d? follows a Binomial(a,, p?) distribution and 
a’ follows a Binomial(d,, p}’) distribution. Also, df 
follows a Binomial (a p> P¥) distribution, where pi is the 
conditional probability that an individual in A, is labeled to 
have died in S,. That is, p? =P[ie D? jie A,]. Clearly, 
pi, <p’. Similarly, ai follows a Binomial (d ,, pi ) 
distribution, where pj} is the conditional probability that an 
individual in D,, is labeled as alive in S,. That is, pj! = 
Plie Ay lie D ; J): Although there is no trivial relationship 
between p‘’ and pj; in general, it is reasonable to assume 
p;’ =p}; in this context of linkage errors. 

Assuming that linkage errors related to different 
individuals are independent, the expectation and variance of 
the difference in the observed number of deaths in S,, 
given by Ad, in (14), are 


E{Ad ,)J=E[di|-Ela}l=a, pi—-d, pj’ (15) 
and 
V [Ad ,]=V[d2]+V[a¥] 
=a, pi (l-pi)+d, pi’d-p). (16) 
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Since A j and D ji consist of different sets of individuals, 
d!; and a‘ are independent. 

Similarly, the expectation and variance of the difference 
in the expected number of deaths in S,, given by Ae, in 
(11), can be calculated as follows. For this purpose, it is 
convenient to write e? and ej’ in terms of the following 
indicator variables. For ic A;, define ,,=J{ie D?} and 
Gi; =/ {ie DE}. Also, for ie D,, define y,, =I {ie AN}. 
Then, from (12) and the definitions of D? and Ay , we 
have 


era CE) (17) 
ie A; 
and 
eN= >, TG j). (18) 
ie D; 


In particular, one can write d a =Diea, Sy and at = 
Dae pd, Wii which are useful to derive (15) and (16). From 
(17) and (18), we have 


E[Ae,]=Ele?]—Ele’] 
SSC Se LEK eee), 
ie Aj ie D; 
and 
V[Ae,]=Vle?]+VI[e%] 
= pt (1— p?) > T2G, J) 
ie Aj 


Boe fas. (20) 


ie D; 


op, 


since A, and D, consist of different sets of individuals. 

The results (15)—(16) and (19)—(20) indicate that 
record linkage errors will lead to bias and additional 
variation in the observed and expected number of deaths. 
Minimizing the variance terms in (16) and (20) is difficult 
since the two error rates p? and p‘ are not functionally 
independent. Generally, decreasing p? will result in an 
increase in Py and vice versa (see section 5 for further 
discussion of this point). Although these error rates are 
independent of the underlying relative risk regression model 
y in (2), the mean square error obtained by combining the 
expectation and variance terms cannot be minimized 
without specification of the baseline hazard A*(u), which 
appears in T,. 


4. The Effect of Linkage Errors on Estimates of 
SMRs and Regression Coefficients 


4.1 Standardized Mortality Ratios 


To determine the effect of linkage errors on the SMR, we 
replace the actual observed and expected numbers of deaths 


ay 


d, and e, by the observed and expected number of deaths 
d‘ and e+ in the presence of linkage errors in the 
expression SMR= did ,,/die,. Letting SMR, denote the 
standardized mortality ratios in the presence of linkage 
errors, we have 


SMR, =SMR 1+ xe / 5. (21) 


It follows, from (10)—(14), that the false positives will 
increase the SMR, whereas the false negatives will decrease 
the SMR. 

By using a first order Taylor series approximation of 
SMR, about SMR, the difference ASMR=SMR, — 
SMR can be expressed as 

> Ae, 


244; 
warden ue 


Then, the mean and variance of the relative difference in the 
SMR can be approximated by 


7 ee Da E[Ad;] De E[Ae;| 
SMR yy di pa} eF 


ASMR 
SMR 


(22) 


(23) 


and 


v[ASMe |-(r4,] v[Eaa, 


dl 


=2 
[es v| De 
j j 
-l =I 
+f ay] Se, cov] Sad, Dae, |. 29 
j j j j 
respectively. The right hand side of (23) can be easily 


calculated by using (15) and (19). In order to calculate the 
right hand side of (24), note that 


vi Dad, =) ViAd, ] 
J J 


+2) Cov[Ad,, Ad 1, (25) 
i<i 
v| Dae, - A) [Ae,]+2 >Cov[A ay Ae |, (26) 
J<J 


and 
cov] Yad, : Dihes 
j j 


=) Cov[Ad ,, Ae, ]+ ) Cov[Ad, 
J 


j#i” 


Ae, }. 27) 


We 
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Without loss of generality, let us assume, for j < j , that 
te < t for the same individual i (alive or dead) in S$; and 


S ,; that is, the entry time in S,, is the same or earlier than 


J 
. We then have, for j <j’, 
+} (28) 
Cov[Ae,, Ae, | 


that in S 
Cortad Ad , | 

P P . . . / 
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Using (25) — (32), the variance of the relative difference 
ASMR/SMR can be approximated by the right hand side 
of (24). Two conclusions can be drawn from (23) and (24). 
First, linkage errors can lead to bias in the estimate of the 
SMR. Second, both types of linkage errors introduce 
additional variation into estimates of the SMR. Note that the 
first term in (32) is dominated by the first term in (29) for 
py < 0.5, and the negative covariance term (28) is 
dominated in the calculation of the variance in (25). 
Therefore, the additional variance (24) is strictly positive, 
since both the false positive and false negative rates are 

positive. 


4.2 Relative Risk Regression Parameters 


To determine the effect of linkage errors on regression 
parameter estimates, consider first the general relative risk 
regression model (2). Replacing the observed and expected 
numbers of deaths d , and e, in the log-likelihood function 
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(7) with the observed and expected numbers of deaths in the 
presence of linkage errors di and e+, we have 


logL= > {di -log(y{B z;})—y{B z,}ef}. G3) 
Let 8 and 6 denote the maximum likelihood estimates of 
B based on {d,, e,} and {d4, e+}, respectively. The 
score equation (9) can be written as 


[d, +Ad, —exp{A,(B)}(e, -Ae,)]=0. (34) 


Assuming that AB = 8-8 is small, a first order expansion 
of exp{A ,(8)} around B gives 


NaN 

ap > 
where ee anh. j(B) and OA , ;/0B is JA ,/OB evaluated at 
B=f. Substitune (35) into (34) leads to 


exp{A ,(B)} = 


(35) 


A 


ONG A 
2 TH [d,, —exp{A ,}e,] 
l= 


Ad, + B'z,}Ae, 


+7{P'z,JAe AB 


e) he 
es 3B 
Using (9), the first summation in (36) is zero. Consequently, 
since Ae AB is small, AB may be approximated by 


ape[3 any eo 
. >B Oe oe, op 
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It follows from (37) that 


etapl=[5 ay 


B’z}Ae,}. (37) 
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where a, = E[Ad,]+y{B’z,} E[Ae,], 
calculated from (15) and (19). Further, 
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© ,=Cov[Ad , + y{B'z, }Ae,, Ad +y{B'z , Ae, I, 


which can also be easily obtained using (16), (20) and 
(28) —(32). 

In the special case of the multiplicative risk model (4), 
the difference AB due to linkage errors may be 
approximated by 


AB=(X’WX)'X(AD+AW), (40) 


where X"= (Zi, nT eA Die Adi Weed AW 
diag(exp(z; B) e,, ... ,exp(z,B)e,), and  AW’= 
(exp(z; B) Deine cAERD( By B)Ae,). Note that the weight 
matrix W is the Fane information matrix for B. It follows 
from (38) that 


EUAD|=(X WX) x (41) 


where II’=(n,, ..., 0,) with =, being same as o,, but 


v{B’z.} ;} teplaced ee exp(Z, B). 
Further, 


WA Bile OW eX VAX OVX (XX ics (42) 


where ‘Y is the matrix of © ,,’s with y{B’ z ,} replaced 
by exp(Z, B). Note that (40) — (42) are nar cases of 
(37)- (39), respectively, written in matrix notation. 

With a single covariate z,=1, XWX=e8 ye, 
X’AD=% jd, and XAW =e! ©), Ae.. In this case, 


Jj i 
3 = 2! Ad, jy tebd Ae, 


43 

Ae (43) 

Since the SMR=e$=),d,/dj;e,, with AB= 
ASMR/SMR in this case, we have 

_ dj Ads DS: (44) 


Di DS : 


Thus, (44) may be viewed as a special case of (22). 

The preceding results indicate that both false positives 
and false negatives will introduce bias and additional 
variation into the estimates of relative risk regression 
parameters. The only negative contribution to this additional 
variance (39) is through Cov[Ad ,, Ad], given by 
(28), and the first term in (32) (see 0, ,). Using the same 
argument as in section 4.1, it follows. that this additional 


variance is strictly positive. 


5. Conclusions 


Record linkage is now a well-established technique in 
epidemological studies of population health risks. By 
linking information on individual exposures from one 
database to information on health outcomes in another 
database, it is possible to construct large-scale informative 
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databases on risks to health of populations and population 
subgroups. The success of such studies will depend to a 
large extent on the quality of the two databases being linked, 
including the amount of information on_ individual 
identifiers used to link individuals in the two databases. In 
most studies, the accuracy of the linkage is examined by 
estimating the false link (false positive) and false nonlink 
(false negative) rates associated with the linkage process. In 
practice, this is usually done by drawing a sample of linked 
and nonlinked records, and determining the accuracy of the 
linkages in the sample using auxiliary information drawn 
from other sources. 

Although CRL has been used for some time in cohort 
mortality studies, the impact of linkage errors on the 
reliability of statistical inferences drawn from such studies 
has not been subjected to detailed investigation. The 
theoretical results presented in this paper address this 
issue.These results show that in addition to inflating the 
observed number of deaths, false positives will tend to 
deflate the expected number of deaths. Conversely, false 
negatives inflate the expected numbers of deaths and deflate 
the observed number of deaths. Linkage errors were shown 
to introduce bias into estimates of SMRs. Relative risk 
regression coefficients are also subject to bias, the direction 
of which depends on the nature of the regression coefficient. 
In addition to these biases, linkage errors introduce 
additional uncertainty into estimates of both SMRs and 
regression coefficients. 

Although we make the simplifying assumption of 
ti = ty one can derive the relevant expressions for bias and 
eescH variability without this assumption; however, the 
expressions are too complex to offer additional insight into 
the effects of linkage errors. This is also true of the 
assumption that p* =p’. There is a technical issue with 
the definition of A, for the state(s) corresponding to the last 
age interval, which is usually open up to © on the right 
hand side. In such state(s), the assumption that t} =? will 
be problematic if the probability of dying in this last interval 
is appreciable. This problem may be circumvented by 
assuming the human life span to have a finite upper limit. 

As discussed at the end of section 3.1, false positives 
occur primarily when an individual who is alive at the end 
of the follow-up period is incorrectly linked with a dead 
person. However, a person who died in one of the states S, 
may be falsely linked with another person with an earlier 
death time. This leads to a false positive which persists until 
the actual time of death; the analysis in section 3 allows for 
this type of error. Similarly, a dead person may be falsely 
linked with another person dying at a later time, who is not 
alive at the end of follow-up. This case is treated as a false 
negative only up to the false death time. At this false time of 
death, this will contribute incorrectly to the number of 
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deaths, an error which has not been considered in section 3. 
However, this type of error would not normally be detected 
in typical record linkage studies in which a simplified 
manual check is used to identify false positives and false 
negatives. Since this type of error is likely to be rare, the 
effect is expected to be small. 

In order to further explore the potential impact of linkage 
errors, let T, be the upper age limit for the j" state S'. 
(Note that some of the tT, ’s may be equal.) Then, letting a 
denote the probability of a linkage error (of either type), the 
false positive and negative rates, p? and pi’, may be 
written as aP[T <t,] and aP[T>t,], respectively. In 
particular, Pi =OP[T ,_; <a ST, ], where T 51 is the 
lower age limit for the j" state, and pj} = p/’. Therefore, 
the false positive rates may be greater than the false negative 
rates in the older age groups, with the reverse happening in 
the younger age groups. Assuming a similar pattern in the 
size of the D j *sand A j ’s, some cancellation of terms may 
take place in the calculation of E[Ae,] in (19) and 
E[Ad,,| in (15). This cancellation effect will reduce the 
expected bias in the SMR and the relative risk regression 
parameters given in (23) and (38), respectively. 

Although we have considered only all-cause mortality in 
this article, cause-specific mortality can be examined by 
simple modifications of the definitions of D iP Di and 
Di. These sets should then consider only those deaths 
from the specific cause of interest. Consequently, d,, and 
e, should denote, respectively, the observed and expected 
number of deaths of the specific type in S,;. The hazard 
function in (1) and (2) should relate to the specific type of 
death, with A*(u) being the corresponding baseline cause- 
specific hazard rate. Finally, the indicator 5, in section 2 
should indicate the specific type of death. 

While the preceding analytical results shed considerable 
light on the effects of linkage errors in cohort mortality 
studies, it is important to investigate such effects under 
conditions as close as possible as may be encountered in 
practice. To this end, we conducted a computer simulation 
study based on actual data from the National Dose Registry 
of Canada, in which the introduction of false links and false 
nonlinks with known probabilities have been used to further 
evaluate the impact of linkage errors on estimates of cancer 
risk (Mallick, Krewski, Dewanji and Zielinski 2002). These 
simulation results corroborate the theoretical findings of this 
paper. 

While the results reported here may help to clarify the 
impact of linkage errors on statistical inference, methods 
that take such errors into account in the statistical analyses 
remain to be developed. Such methods may be based on 
response error models employed in survey sampling, used in 
conjunction with traditional statistical methods for analyses 
of cohort mortality data. Research in this area is underway. 


Statistics Canada, Catalogue No. 12-001-XPB 


6. Acknowledgements 


This research was supported in part by a grant from the 
National Science and Engineering Research Council of 
Canada to D. Krewski, who currently holds the 
NSERC/SSHRC/McLaughlin Chair in Population Health 
Risk Assessment at the University of Ottawa. Preliminary 
versions of this paper were presented at the Annual Joint 
Meeting of the American Statistical Association in San 
Francisco, August 8-12, 1993, and the Annual Meeting of 
the Statistical Society of Canada, Montreal, July 10-16, 
1995. The final draft was presented in the session in honour 
of J.N.K. Rao at the Statistics Canada Symposium 2001 
held in Ottawa on October 18, 2001. The first author 
(D. Krewski) is particularly grateful to have been invited to 
speak in the session in honour of J.N.K. Rao, who served as 
his doctoral thesis supervisor many years ago. This work 
was completed while A. Dewanji was a Visiting Scholar at 
the McLaughlin Centre for Population Health Risk 
Assessment in the summer of 2002 and 2003. 


References 


Anderson, T.W. (1974). An Introduction to Multivariate Statistical 
Analysis. New York: John Wiley & Sons, Inc. 


Ardal, S., and Ennis, S. (2001). Data detectives: Uncovering 
systematic errors in administrative databases. Proceedings: 
Symposium 2001, Achieving Data Quality in a Statistical Agency: 
A Methodological Perspective, Statistics Canada, Ottawa. 


Ashmore, J.-P., and Grogan, D. (1985). The national dose registry of 
Canada. Radiation Protection Dosimetry, 11, 95-100. 


Ashmore, J.-P., and Davies, B.D. (1989). The national dose registry: 
A centralized record keeping system for radiation workers in 
Canada. In Applications of Computer Technology to Radiation 
Protection, IAEA-SR-136/58, J. Stephan Institute, Ljublyua, 505- 
520. 


Ashmore, J.-P., Krewski, D. and Zielinski, J.M. (1997). Protocol for a 
cohort mortality study of occupational radiation exposure based on 
the national dose registry of Canada. European Journal of Cancer, 
33, $10-S21. 


Ashmore, J.-P., Krewski, D., Zielinski, J.M., Jiang, H., Semenciw, R. 
and Létourneau, E. (1998). First analysis of occupational radiation 
mortality based on the national dose registry of Canada. American 
Journal of Epidemiology, 148, 564-574. 


Bartlett, S., Krewski, D., Wang, Y. and Zielinski, J.M. (1993). 
Evaluation of error rates in large scale computerized record 
linkage studies. Survey Methodology, 19, 3-12. 


Belin, T.R., and Rubin, D.B. (1991). Recent developments in 
calibrating error rates for computer matching. Proceedings of the 
1991 Annual Research Conference, U.S. Bureau of the Census, 
657-668. 


Breslow, N.E., Lubin, J.H. and Langholz, B. (1983). Multiplicative 
models and cohort analysis. Journal of the American Statistical 
Association, 78, 1-12. 


Survey Methodology, June 2005 


Breslow, N.E., and Day, N.E. (1987). Statistical Methods in Cancer 
Research, Vol. 2: The Design and Analysis of Cohort Studies. 
IARC scientific publication No. 82, international agency for 
research on cancer, Lyon, France. 


Carpenter, M., and Fair, M.E. (Eds.) (1990). Canadian Epidemiology 
Research Conference — 1989: Proceedings of Record Linkage 
Sessions & Workshop. Ottawa Select Printing, Ottawa. 


Cox, D.R. (1972). Regression models and life tables (with 
discussion). Journal of Royal Statistical Society, B, 34, 187-220. 


Fair, M.E. (1989). Studies and References Relating to Uses of the 
Canadian Mortality Data Base. Report from the occupational and 
environmental health research unit, Health Division, Statistics 
Canada, Ottawa. 


Fellegi, I., and Sunter, A. (1969). A theory for record linkage. Journal 
of the American Statistical Association, 64, 1183-1210. 


Hill, T. (1988). Generalized Iterative Record Linkage System: GIRLS 
Strategy (Release 2.7). Report from research and general system, 
informatics services and development division, Statistics Canada, 
Ottawa. 


Howe, G.R., and Lindsay, J. (1981). A generalized iterative record 
linkage computer system for use in medical follow-up studies. 
Computers and Biomedical Research, 14, 327-340. 


Howe, G.R., and Spasoff, R.A. (Eds.) (1986). Proceeding of the 
Workshop on Computerized Linkage in Health Research. 
University of Toronto Press, Toronto. 


Jordan-Simpson, D.A., Fair, M.E. and Poliquin, C. (1990). Canadian 
farm operator study: Methodology. Health Reports, 2, 141-155. 


Kalbfleish, J.D., and Prentice, R.L. (1980). The Statistical Analysis of 
Failure Time Data. New York: John Wiley & Sons, Inc. 


Labossiére, G. (1986). Confidentiality and access to data: The 
practice at Statistics Canada. Proceedings of the Workshop on 
Computerized Record Linkage in Health Research, University of 
Toronto Press, Toronto. 


Mallick, R., Krewski, D., Dewanji, A. and Zielinski, JM. (2002). A 
simulation study of the effect of record linkage errors in cohort 
mortality data. Proceedings of International Conference in Recent 
Advances in Survey Sampling. Carleton University, Ottawa, to 
appear. 


21 


Neter, J., Maynes, E.S. and Ramanathan, R. (1965). The effect of 
mismatching on the measurement of response errors. Journal of 
the American Statistical Association, 60, 1005-1027. 


Newcombe, H.B. (1988). Handbook of Record Linkage: Methods for 
Health and Statistical Studies, Administration, and Business. 
Oxford Medical Publications. Oxford. 


Roos, L.L., Soodeen, R. and Jebamani, L. (2001). An information- 
rich environment: Linked-record systems and data quality in 
Canada. Proceedings: Symposium 2001, Achieving Data Quality 
in a Statistical Agency: A Methodological Perspective, Statistics 
Canada, Ottawa. 


Scheuren, F., and Winkler, W.E. (1993). Regression analysis of data 
files that are computer matched. Survey Methodology, 19, 39-58. 


Scheuren, F., and Winkler, W.E. (1997). Regression analysis of data 
files that are computer matched—Part II. Survey Methodology, 23, 
157-165. 


Singh, A.C., Feder, M., Dunteman, G. and Yu, F. (2001). Protecting 
confidentiality while preserving quality of public use micro data. 
Proceedings: Symposium 2001, Achieving Data Quality in a 
Statistical Agency: A Methodological Perspective. Statistics 
Canada, Ottawa. 


Smith, M.E., and Silins, J. (1981). Generalized iterative record 
linkage system. Social Statistics Section, Proceedings of the 
American Statistical Association, 128-137. 


Sont, W.N., Zielinski, J.M., Ashmore, J.P., Jiang, H., Krewski, D., 
Fair, M.E., Band, P. and Létourneau, E. (2001). First analysis of 
cancer incidence and occupational radiation exposure based on the 
national dose registry of Canada. American Journal of 
Epidemiology, 153, 309-318. 


Winkler, W.E., and Scheuren, F. (1991). How computer matching 
error effect regression analysis: Exploratory and confirmatory 
analysis. Technical report, Statistical research division, U.S. 
Bureau of the Census, Washington, D.C. 


Statistics Canada, Catalogue No. 12-001-XPB 


, oe. ia ae 
es genes eee 


. et Tae ore =e 
paar Pret ve fies o> Pape ares 
vy 


aire 


Piantcename: SCE 


Lap * tcl bad extiy 
= 


partes ai) ee nae 
PS ae my 
or Ley te £=e Ive 
ee CR Si al Le heh ymin | 
SEMEN TANS Bin ie abla gale <" coy ald hee en YRS 
wht yd S eee ata ig ovata sae eer on enter jm 


> an 4 - | are nY 1 8 Ae 
——- oF hill nent fy: ee ae ao jms & 
a av LP + aed Ms 
ror Ty SS Pe 
Asi? Grate e ee ane Po eames Vor . oe e eas ane 
my. yi Tee : ; hah OR, Sis Sey, ies ieaidatins hi. wd op yo 401 i ‘Lads ras 
, - a VEPre ov iia oat aveeas ‘A pigs # SygaK > i oy 


sheep as 4 =) AS one U > ely ae84) ene iM" : fe . ay fe Lara (et 


a ¢ RCs een Ma pant is ty he wreak) STR ad 7 tae “AM ve) mF, 
: ivi 3.) ‘ me 4 ee “ge “ly bs a ‘ 
’ i tae: i J ail ; alt ‘ ta: ray ri] Ke; ais wt fs) Ny ne ys ete } ‘ia 
va ony My Rats i y Sonat Goa ito aoe 
ons en te ee . rear) p i a ret big Fi ef ra PCA sah ae “ig 


— » 
; i deep a 7 

ts ots fran Pate tORNS 7 D. Tea mit 7 ts) — =) frac {rare mnti ens CRN. is 

‘ icsdivteed!! ance Seed A soe ee vs tp eonltieet Ge. 6 lero’ codon re 

ih joe altne WW ate) rtd hp aaoet etn Peter eoee) thatier Mele ee 


ke tek | ssn, js ont, aan 
he : cx ive 7 ign ( iti laa: aime ay KDE 

' : wy Able Xi, 2 They Seat Nx Oe Ra wih oo . 

ore ARE) KLitiadtreinl ty r any, 


oF drat” AAG OR! a as: co 


i ‘4 res TA hy, dase Rall 


: : . $. Ly cen Aeectnoelity 7S SP 
‘ (alee ¢ 
sal 04 aie: Mancuma See 
” ‘ : : ' 
* a! 
Ply 4) } a! 
e J ‘ny en ff "% 4 uty 2? 
P i A ee, 
» oD aa) "Gy — d 
ri a as 
’ - 4 
jt a a ty w? P ¥ ah, 
: Vited 228 : 7 ats 
ry 
tomas vi Gy ¢ " 
" Ain * é ie » 
j ni = “ “anh i 
ee iprwdg ja! fF “ers ‘ 
ra A 
Lie g (AR D110 ieee wr onl swe, & 
; iy) PCr. BK ves aire «5 he ipa of OY 
Om i a“ > hath oe ii 
\ ’ 
u é 
n J a ah 7) 
1 i" i) 
nv bon padat> hes AScivy . 


Survey Methodology, June 2005 
Vol. 31, No. 1, pp. 23-40 
Statistics Canada, Catalogue No. 12-001-XPB 


23 


Analysis of Experiments Embedded in Complex Sampling Designs 


Jan A. van den Brakel and Robbert H. Renssen ' 


Abstract 


At national statistical institutes, experiments embedded in ongoing sample surveys are conducted occasionally to investigate 
possible effects of alternative survey methodologies on estimates of finite population parameters. To test hypotheses about 
differences between sample estimates due to alternative survey implementations, a design-based theory is developed for the 
analysis of completely randomized designs or randomized block designs embedded in general complex sampling designs. 
For both experimental designs, design-based Wald statistics are derived for the Horvitz-Thompson estimator and the 
generalized regression estimator. The theory is illustrated with a simulation study. 


Key Words: Design-based analysis; Measurement error models; Probability sampling; Randomized experiments; 


Superimposition. 


1. Introduction 


A part of survey methodology is to consider and test 
alternative survey methods, to improve the quality and 
efficiency of sample survey processes at national statistical 
institutes. Large-scale field experiments embedded in 
ongoing surveys are particularly appropriate to quantify the 
effect of alternative survey implementations on response 
behavior or estimates of finite population parameters. At 
Statistics Netherlands, for example, the effects of alternative 
questionnaire designs, different approach strategies or 
different advance letters have been investigated on both 
kinds of parameters, see Van den Brakel and Renssen 
(1998), Van den Brakel (2001), and Van den Brakel and 
Van Berkel (2002). At national statistical institutes, sample 
surveys are generally kept unchanged as long as possible in 
order to construct uninterrupted time series of estimates of 
population parameters. It is inevitable, however, that survey 
processes are adjusted from time to time. Embedded 
experiments can be applied to detect and quantify possible 
trend disruptions in these time series due to necessary 
changes to a sample survey and provide a safe transition 
from an old to a new survey design. Running the old and 
new surveys concurrently by means of an embedded 
experiment creates the possibility of falling back on the old 
approach for regular publication purposes if the new 
approach turns out to be a failure. 

Applications of embedded experiments in the literature 
are aimed at the estimation of the bias or the various 
variance components in total measurement error models. 
Mahalanobis (1946) introduced the idea of embedding 
experiments in ongoing sample surveys, probably for the 
first time, as interpenetrating subsampling to test interviewer 
differences under simple random sampling and unrestricted 


randomization of sampling units to interviewers. Fellegi 
(1964) and Hartley and Rao (1978) generalized this 
approach to estimate response variances under more 
complex sampling designs and restricted randomization of 
sampling units. Fienberg and Tanur (1987, 1988, 1989) 
discuss the differences and parallels between the theory of 
experimental designs and finite population sampling and 
how the statistical methodology employed in both fields can 
be combined in a useful and natural way in the design and 
analysis of embedded experiments. In their 1988 article, 
they give a comprehensive overview of applications of 
embedded experiments mentioned in the literature. 

The typical situation considered in this paper is a field 
experiment designed to compare the effect of K different 
survey implementations, i.e., the treatments, on the main 
estimates of the finite population parameters of a current 
survey. To this end, a probability sample that is drawn from 
a finite target population is randomly divided into K 
subsamples according to an experimental design. Each sub- 
sample is assigned to one of the K treatments. The experi- 
mental designs considered in this paper are completely 
randomized designs (CRD’s) and randomized block designs 
(RBD’s) where sampling structures like strata, primary 
sampling units (PSU’s), clusters or interviewers are 
potential block variables. Generally one large subsample is 
assigned to the regular survey, which will be used for 
official publication purposes and which will simultaneously 
serve as the control group in the experiment. The purpose of 
embedded experiments is the estimation of finite population 
parameters under the different survey implementations and 
to test hypotheses about the differences between estimates 
of those parameters. 

At first instance, a standard model-based approach might 
be considered for this analysis. Since experimental units are 
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drawn by means of a complex sampling design without 
replacement from a finite population, the application of such 
an approach might result in design-biased parameter and 
variance estimates. This makes the analysis results incom- 
mensurate with the parameter and variance estimates of the 
ongoing survey, which complicates the interpretation of the 
results in the design-based setting of the sample survey. To 
make the analysis more robust to departures from the 
assumed model, a design-based analysis that accounts for 
the sampling design should be applied. 

Before we present our design-based approach two 
alternatives are mentioned that, at first glance, seem to be 
correct. We briefly argue, however, that both alternatives 
generally give invalid results. The first alternative is to apply 
a design-based linear regression analysis that accounts for 
the sampling design to estimate and test hypotheses about 
the K treatment effects in the regression model. This 
approach easily results, however, in wrong design variances, 
since the randomization of the experimental design is 
ignored. The main analysis objective of embedded experi- 
ments is to compare the effect of alternative survey 
approaches on the main estimates of the current sample 
survey. A linear regression analysis doesn’t precisely meet 
this objective, since the treatment effects in the regression 
model are generally not equal to the differences between the 
subsample estimates. 

The second alternative is to apply a design-based 
inference for comparing domain parameters, in which the K 
treatments are considered as K domains. The objective of an 
embedded experiment, however, is to compare estimates of 
the same parameter under different survey strategies or 
treatments, whereas in the case of domain parameters the 
objective is to compare estimates of different population 
parameters under basically the same survey strategy. 

The approach presented in this paper can be summarized 
as follows. Based on the K subsamples, a design-based 
estimator for the population parameter observed under each 
of the K treatments, and a design-based estimator for the 
covariance matrix of the K—1 contrasts between these 
estimates are derived. This estimation procedure accounts 
for the probability structure of the sampling design, the 
random assignment of sampling units to treatments due to 
the experimental design, and the weighting procedure 
applied in the ongoing survey for the estimation of target 
parameters. This gives rise to a design-based Wald statistic 
to test the stated hypotheses about differences between 
sample survey estimates. 

The main contribution of this paper is to provide a 
general framework for comparing K alternative survey 
approaches in the realistic situation of a full-scale sample 
survey process. The random selection of sampling units 
from a finite target population by means of a probability 
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sample is used in combination with randomization of the 
sampling units over different treatments according to an 
experimental design. This facilitates comparison of alter- 
native survey implementations on the main outcomes of a 
sample survey and the generalization of the observed results 
to populations larger than the sample included in the 
experiment. The analysis procedure proposed in this paper 
generalizes the analysis of two-treatment experiments 
embedded in sample surveys (Van den Brakel and Renssen 
(1998) and Van den Brakel and Van Berkel (2002)) to 
CRD’s and RBD’s with K >2 treatments. An important 
result is that the design-based estimator for the covariance 
matrix of the contrasts between the subsample estimates has 
a relatively simple structure, as if the sampling units were 
drawn with replacement and unequal selection probabilities. 
As a result neither joint inclusion probabilities nor design- 
covariances between the subsample estimates are required 
in the variance estimation procedure. This results in an 
attractive and relatively simple analysis procedure. A 
second advantage is that this procedure tests hypotheses on 
differences between the sample estimates of the survey, 
which facilitates the interpretation of the analysis results in 
many applications. 

A design-based theory for the analysis of embedded 
experiments is presented in section 2. In section 3 it is 
explained in more detail why the design-based linear 
regression analysis is less appropriate. In section 4, the 
proposed design-based analysis procedure is evaluated in a 
simulation study. Conclusions are summarized in section 5. 


2. Analysis of Embedded Experiments 


2.1 Measurement Error Models 


Although the analysis procedure for embedded experi- 
ments proposed in this section is design-based, some use is 
made of measurement error models. Testing systematic 
effects of different survey methodologies on the outcomes 
of a survey implies the existence of measurement errors. 
The traditional notion that observations obtained from 
sampling units are true fixed values observed without error, 
generally assumed in design-based sampling theory, is not 
tenable in such situations. Therefore a measurement error 
model is specified for the observations obtained under the 
different survey implementations or treatments of the 
experiment. This model links the treatment effects to 
systematic differences between finite population parameters. 

Consider a finite population U of N individuals. Let 
variable y,,, denote the potential response of the i" indi- 
vidual (i=1,2,...,N) observed by means of the k™ 
treatment (k=1,2,...,K) and the J" interviewer 
(J =1,2,..., L). It is assumed that these observations are a 
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realization of the measurement error model y,, = 
u, +B, +W,, +€,,-Here u, is the true, intrinsic value of the 
i” individual, B, the effect of the k" treatment, y,, the 
effect of the /" interviewer on the i" individual and ¢,, an 
error component of the i" individual observed by means of 
the k™ treatment. The interviewer effect y,, allows for 
systematic clustering and correlation between the responses 
of the individuals assigned to the same interviewer due to 
fixed and random interviewer effects, ie, W, =W,+6&), 
with w, the fixed and &, the random effect of the /™ 
interviewer. Besides interviewers, common factors such as 
coders and supervisors might also induce correlation 
between the responses of the individuals. 

Since for each sampling unit a potential response variable 
is defined for each of the K different treatments, the 
measurement error model can be expressed in matrix 
notation as 


Y, =Ju, +B+jV, +€;, (1) 
icc r ey =" Vik tee) p= (Dees pe). oO eEe= 
eye) wand” ja, Vin Det -E?, “and Cov, 
denote the expectation and the covariance with respect to 


the measurement error model. The following model 
assumptions are made: 


E(e)= 0; (2) 
ibe pends 

Covg(6.8H)= 16. sot (3) 

E,, (6) =9, (4) 
fd hit 

© fay ae 5 

ov, (S/S) a ey (5) 

Cove cy=0; (6) 


where 0 is a vector of order K with each element zero and O 
a matrix of order K x K with each element zero. If y, =0, 
then a model with only random interviewer effects is 
obtained. If t/ =0, then a model with only fixed inter- 
viewer effects is obtained. From the assumptions, it follows 
that 


E,, (¥,) =ju; + Jv, +B, (7) 
and 
Spe pa Peel Mand = 
Cov, (¥nsYiv) = jiit?: ii’ and 1=!’. 8) 
O: iti and [4l’ 


Any correlation between the responses of different indi- 
viduals can be modeled by means of random interviewer 
effects. Any fixed interviewer effects influence the expected 
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response values. From now on, for notational convenience, 
the subscript / will be omited in y,, and y,,. 


2.2 Hypotheses Testing 


The measurement error model for the observations 
obtained in the experiment enables us to relate systematic 
differences between population parameters to the different 
survey implementations. Suppose that L interviewers are 
available for the data collection. The population U of size N 
can conceptually be divided into L groups U, of size 
N,,/=1,..., L, such that all individuals within a group are 
potentially interviewed by the same interviewer. Let 
Y=(Y,,Y,,...,¥,)' denote the K dimensional vector of 
population means of y,, i.e., 


2h gs Ame Neg Lead ii UY, ings 
Y=j— >) u+B+j>) —y,t+j> —é,+— 2, ¢,-0 
ee it+B id WY! id, rad ee el?) 
The objective of the experiment is to investigate whether 
there are systematic differences between the K population 
means of Y due to the K different survey strategies or 
treatments. This can be accomplished by formulating hypo- 


theses about 
oa 1 N 1h N 
E,,(Y)=j— >) 4, +5>) —v, +B. (10) 
N ‘a ebay, 


where the expectation is taken over the measurement error 
model. This gives rise to the following hypothesis: 


H, :CE,,¥ =0, 
H,:CE, Y #0, (11) 


where C denotes a (K —1)x K matrix with K —1 contrasts 
and 0 a K —1 vector of zeros. Since Cj=0, it follows that 
CE,,Y =CB and hypothesis (11) concerns the treatment 
effects as represented by B in the measurement error model 
(1). The contrasts between the population parameters neatly 
correspond to these treatment effects. For the randomized 
experiments considered in this paper, it holds that each 
experimental unit assigned to an interviewer / has a nonzero 
probability of being assigned to each of the K treatments. 
Therefore, the bias in the parameter estimates due to fixed 
interviewer effects is the same under each of the K 
treatments and cancels out in the K —1 contrasts between 
the K parameter estimates. 

Hypothesis (11) will be tested by estimating E, Y 
instead of B, taking into account the sampling design, the 
experimental design, and the weighting procedure of the 
ongoing survey applied for the estimation of population 
parameters. To test (11), a probability sample drawn from a 
finite population is available. The sampling units 
(experimental units) are randomized over K subsamples and 
are assigned to one of the K treatments. In section 2.3 a 
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design-unbiased estimator for E,, Y, denoted Y is derived. 
For example Y may be the Horvitz-Thompson estimator or 
the generalized regression estimator. Let V denote the 
covariance matrix of Y. An (approximately) design- 
unbiased estimator for the covariance matrix of the K —-1 
contrasts of Y, denoted CVC’, will be derived in section 
2.4. Now, hypothesis (11) can be tested by means of the 
following design-based Wald statistic: 


W =Y'C' (CVC) CY. (12) 


For mathematical convenience, we prefer the contrast 
matrix C =(j:—I), where j is a K —1 vector of ones and I 
the (K —1)x(K —1) identity matrix. 


2.3 Estimation of Treatment Effects 


2.3.1 Horvitz-Thompson Estimator 


Consider a sample s drawn by a generally complex 
sampling design, that can be described by the first and 
second order inclusion probabilities 2, and 1, of the i™ 
and i,i’" sampling unit(s) respectively. In the case of a 
CRD, sample s is randomly divided into K subsamples s, 
of size n,. If n, =D, n, denotes the number of sampling 
units in s, then the conditional probability that the i” 
sampling unit is selected in subsample s,, given that 
sample s is selected, is equal to n,/n,. In the case of an 
RBD the sampling units are, conditionally on _ the 
realization of s, deterministically divided into J blocks s - 
Potential block variables are sampling structures like strata, 
clusters, PSU’s, interviewers and the like. Within each 
block, the sampling units are randomized over the K 
treatments. Let n,, denote the number of sampling units in 
block j assigned to treatment k. Then n,, =Yij ny 
denotes the size of block j,n,, = Xj. n, denotes the size 
of subsample s, and n,, = Yi, Dj n,, denotes the size 
of sample s. The conditional probability that the i™ 
sampling unit is selected in subsample s,, given that 
sample s is selected and ie s,, is equal to n, /n,,. 

Each subsample s, can be considered as a two-phase 
sample, where the first order inclusion probabilities of the 
first phase sample are obtained from the sampling design 
and the conditional first order inclusion probabilities of the 
second phase sample are obtained from the experimental 
design. From this point of view, the first order inclusion 
probabilities for the elements of s, are equal to 1, = 
(n,/n,)%, for CRD’s and 1, = (n,,/n,,) %;, for RBD’s if 
this i“ sampling unit is assigned to the j" block. It 
follows that the Horvitz-Thompson estimator for Ye based 
on the n,, observations obtained from subsample s, can 
be defined as: 
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A ie : 1 Wen ‘ : 
Yur ayy ——— diaiain a (13) 


where p,, are K—vectors that describe the randomization 
mechanism of the experimental design. For a CRD, it 
follows that 


n, ; ; 
—r, if ies 
Pix = Ny, : i ’ (14) 
Oi, tiewkeesy 
and for an RBD 
N js : . 
— r, if ies, 
Pe =e is (15) 
Ons hay Cease 


where r, denotes the unit vector of order K with the k" 
element equal to one and the other elements equal to zero 
and 0 denotes a K vector of zeros. Properties of the vectors 
P,, are given in the appendix. 

Now, since s, can be considered as a two-phase sample 
it holds that E,E,(¥,.x7|s,m)=Y,, where E, and E, 
denote the expectation with respect to the sample design and 
the experimental design, respectively. So, given m, the 
vector Yur =(Yi.yr>---> Yx.yr)’ is proposed as a design- 
unbiased estimator for Y. But then, Y,, is unbiased for 
bath 


2.3.2 The Generalized Regression Estimator 


In finite population sampling it is customary to increase 
the accuracy of the Horvitz-Thompson estimator, if suitable 
auxiliary information is available, by means of the gene- 
ralized regression estimator, see e.g., Bethlehem and Keller 
(1987) and Sarndal, Swensson and Wretman (1992). The 
generalized regression estimator enables us to incorporate 
the weighting scheme of the ongoing survey in the analysis 
of embedded experiments. This might decrease the design 
variance as well as the bias due to selective nonresponse and 
therefore it may increase the accuracy of the experiment. In 
the present context the generalized regression estimator 
therefore represents a design-based analogue of covariance 
analysis in standard experimental design methodology. 

Besides the values of the response variable y;, we also 
associate with each unit in the population an H—vector x,, 
of auxiliary information. The finite population means of 
these auxiliary variables are assumed to be known and are 
denoted by X. It is also assumed that the auxiliary variables 
are intrinsic values, that can be observed without measure- 
ment errors, and so are not affected by the treatments. When 
the model assisted approach of Sarndal et al. (1992) is 
followed, the intrinsic values u,; in the measurement error 
model of section 2.1 for each unit in the population are 
assumed to be an independent realization of the following 
linear regression model: 
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u, = B’x,+e,, (16) 


where B is an H-vectors containing the regression 
coefficients and the e, are the residuals. In the model 
assisted approach of Sarndal et al. (1992), the intrinsic 
values u, are considered to be a realization of an underlying 
superpopulation model defined by (16). In this case the 
residuals e, are independent random variables with a 
variance (;. Then it is required that all w; are known up to 
a common scale factor; that is @; =v,@° with v, known. 
From a strictly design-based point of view, proposed by 
Bethlehem and Keller (1987), there is no need to adopt a 
superpopulation model. In that case the residuals are fixed 
intrinsic values of the elements in the finite population and 
no model assumptions about the residuals are needed. In this 
paper, the model assisted approach of Sarndal is adopted. 
This implies that expectations with respect to the measure- 
ment model, as in (7) and (10), are conditional on the 
realization of the intrinsic values u,,i=1,...,N, in the 
finite population according to the superpopulation model 
(16). 

The regression coefficients of the linear model (16) in the 
finite population are defined as 


N RNB ay 
b= 1 Lit 17 
bycoglpers a 


1 


The intrinsic values u, are not observable due to 
measurement errors and treatment effects. Consequently, 
(17) cannot be computed, even in the case of a complete 
enumeration of the finite population. In the case of a 
complete enumeration under the k"™ treatment 


=f 
h - X;X; - Rey; 
b, -(5 | ps oe 


jens i=l F 


| Ph We Pa een ce (18) 


denotes the finite population regression coefficients of the 
linear model (16). Conditional on the realization of 
u;,i=l,...,.N, the expectation of the finite population 
regression coefficients b, with respect to the measurement 
error model is given by 


=1 
ee a XX, | es Xe (+P 
EB. =| 3 a >) XM; +P TW) =p b,. 


ja OO) i=l 0; 


1 


k =1,2,..., K. (19) 


The finite population regression coefficients b, and b, 
can be estimated using the sample data from subsample s,, 
with the Horvitz-Thompson estimator: 
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Now the generalized regression estimator for Y,, based on 
the n,, observations of subsample s,, is defined as 


Vesacuetl acre ta, (ete gh ee IL, Par K arty (20) 
where ea denotes the Horvitz-Thompson estimator for 
the population means of the auxiliary variables X based on 
the n,, sampling units of subsample s,. 

When expressing (20) as a function of (Y,. aD a x), 
the generalized regression estimator can be approximated by 
means of a first order Taylor linearization about 
(E,, Y,,b,,%), where b, is defined in (19). This gives: 


Veet Oi Me DN, ik 17 2,25 A 
with 


> 


es) 


7 Pi. (y; — B’x;) 
a ey op a = aaa 
k;HT k;HT k HT 2 I, N 
and where B is an H XK matrix of which the columns are 
the H-vectors b,. Now You.= = (Y,, eyes aieeasy is 
proposed as an Ree design-unbiased estimator for 
E_Y. 
2.4 Variance Estimation of Treatment Effects 


Let V denote the covariance matrix of Wee To 
estimate the covariance terms of V, vectors y, containing 
the observations of all K treatments obtained from each 
sampling unit are required. Since in the experimental 
designs under consideration each sampling unit is assigned 
to one of the K treatments, only one of the components of 
y,;, for ie s, is actually observed. Consequently, a design- 
unbiased estimator for V cannot be derived. Van den Brakel 
and Binder (2000, 2004) tried to overcome this problem by 
imputing the unobserved components. The usefulness of 
their results, however, depends on the correctness of the 
imputation model. In the present paper, this problem is 
circumvented by deriving a design-based estimator for 
CVC, i.e., the covariance matrix of the contrasts of 
Yorec» Which is sufficient for the Wald statistic (12). 

Expressions for the generalized regression estimator are 
derived first. Results for the Horvitz-Thompson estimator 
are given as a special case. The covariance matrix of the 
contrasts of Yorzg can be approximated by the covariance 
tatrix votsthesconttastsrol., Hima= (Bye ek, ny) | Let 
Cov, and Cov, denote the covariances with respect to the 
sample design and the experimental design respectively. 
Now, consider the following variance decomposition: 


CVC' =Cov, E.E,(CE,,,|,5) 
+E_Cov,E, (CE,,;|m,5) + E,,E,Cov, (Che: lm,s). (QD 
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Since E,(p,) =r, (see (42) in the appendix), it follows that 


B 
E,Byrlm,s) => cee (22) 


Under the condition that a constant H—vector a exists such 
that a'x, =1 forall ie U, itis proven in the appendix that 


Cty; 


The stated condition implicitly assumes that the size of the 
finite population is known and is used as auxiliary informa- 
tion. This condition holds for weighting models that contain 
an intercept or one or more categorical variables that parti- 
tion the population into subpopulations. Using model 
assumptions (2) and (3), it follows from (22) and (23) that 


~B'x,) =Ce,. (23) 


m AY é 


N 
Cov,,E,E (CE fms) =Covg LS ce, 
i=l 
1 N 
=—ycy,c, @ 
N i=l 


and 


E, Cov,E, (CEy,;|m,s) = E,, 


AOD hem 


i=l 


bres t 2 
Fg De Nez, ee (25) 


TT; 


Ci Cou 
TT 


For the third term in (21), it is proven in the appendix for 
an RBD that 


E,, E,Cov, (CE,,,|m, s)=E,,E, (CDC’) 


vedas tuiiy (26) 


where Dis a K XK diagonal matrix with diagonal elements 


d Se 
Dae ar a1 


Jat ik i+ i 


Nis 


Ns (Viz —b;X;) NE 
Nt, n, 


i j+ i=l 


vay yess Re ERASE: 
ee eee m8 OF] 
NT; Sey n. ce) 


If the results obtained in (24), (25) and (26) are inserted in 
(21), then it follows that 


CVC =E,E,CDC.. (28) 


m AY 


Conditionally on the realization of m and s, an approxi- 
mately design-unbiased estimator for D in (28) can be 
derived. Therefore, CVC’ can conveniently be stated 
implicitly as the expectation over the measurement error 
model and the sampling design. See Van den Brakel (2001) 
for explicit expressions for CVC’. Given the realization of 
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m and s, the allocation of the sampling units within each 
block to the subsamples s,, can be considered as simple 
random sampling without replacement from block s_. 
Consequently, for an RBD it follows that an approximately 
design-unbiased estimator for D is given by a KxK 
diagonal matrix D with diagonal elements 


Xt (he nal, bee 
se Daeg me os 


nN; x : § b 
yeu Bie) __1 3 An On “bime) | so Ses a9 
Nt, Ni, A NT a 


An approximately design-unbiased estimator for CVC’ in 
(28) is given by CDC’. Results for a CRD follow directly 
as a special case from (27) and (29) where J =1,n,, =n, 
and n, =n,. As an alternative, the residuals (y, ahi X;) 
in (29) can be multiplied by the correction weights ae 
called g—weights, Sarndal et al. 1992, result 6.6.1). Since, 
CVC’ in (28) is defined implicitly as the expectation over 
the sampling design, (29) is approximately design-unbiased 
under general complex sampling schemes. This variance 
estimator only requires that the fraction of sampling units 
assigned to the different treatments according to the 
experimental design is fixed in advance. The size of the 
sample as well as the blocks might be random with respect 
to the sample design, e.g., in the case of an RBD where 
clusters or PSU’s are the block variable. 

The variance estimator CDC’ has a structure as if the K 
subsamples had been drawn independently from each other, 
where the sampling units are selected with unequal proba- 
bilities (7; /n,) with replacement in the case of a CRD, or 
(1, /n,,) with replacement within each block j in the case 
of an RBD (compare (29) with Cochran 1977, equation 
(9A.16)). It is remarkable that the second order inclusion 
probabilities of the sampling design have vanished. This is 
caused by: 


1. The assumption of additive treatment effects in the 
measurement error model, i.e., B, for all ie U 
observed under treatment k. 


2. The assumption that measurement errors between 
individuals are independent. 


2. A properly chosen weighting scheme such that the 
condition a'x, =1 for all ie U is satisfied. 


4. The fact that variances are calculated for the 
contrasts between the subsample means. 


The design variance of the first-order Taylor series approxi- 
mation of the generalized regression estimator consists of 
the residuals (y, —b;x;,). From the proof of (23) it follows 
that under a weighting scheme that satisfies the condition 
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a’x, =1 forall ie U, the treatment effects B, vanish from 
the residuals (y, —b;x,) in (23). In these residuals three 
terms remain: 


1. The residual of the linear regression model of the 
intrinsic value, i.e., e, =u; —b’x;. 

2. A term concerning the bias due to the interviewer 
effects. This term is equal to wy, —d‘x,, where d 
denotes the regression coefficients from _ the 
regression function of the interviewer effects on the 
auxiliary variables x;, (see proof of (23) in the 
appendix). 


3. The measurement errors €,,. 


The residuals of the intrinsic values e, and the bias due to 
the interviewer effects do not depend on the different treat- 
ments and therefore cancel out in the contrasts of the resi- 
duals in (23). Only the measurement errors €, remain in the 
contrasts of the residuals in (23). As a result, the two terms 
Cov, E,E, (CEn; lm,s) and EE Cov\E;(CE,,,; |m, 5) 
only contain the measurement errors €,. Due to the 
assumption of independence of the measurement errors 
between individuals, the cross products between individuals, 
which contain the second order inclusion probabilities in 
(24) and (25) vanish. The covariance structure of the third 
term of (21) is mainly determined by the randomization 
mechanism of the experimental design. For a CRD this 
comes down to the selection of K subsamples from s by 
means of simple random sampling without replacement. For 
an RBD this comes down to the selection of K subsamples 
from s by means of stratified simple random sampling 
without replacement where strata correspond to the blocks 
of the experiment. In the variance of the contrasts of the 
subsample means, the finite population corrections in the 
design variance of the subsample means cancel out against 
the design covariance between the subsample means. As a 
result, the leading term of (26), ie., E,,E,CDC’, has a 
structure as if the K subsamples were drawn independently 
of each other by means of simple random sampling with 
replacement in the case of a CRD, or stratified simple 
random sampling with replacement in the case of an RBD. 
Second order inclusion probabilities appear if the 
expectation with respect to the sampling design in (28) is 
made explicit, see Van den Brakel (2001). 

The minimum use of auxiliary information is a weighting 
scheme where x, =(1) and w, =@° for all i¢ U. Under 
this weighting scheme it follows that 


A 


tm(Ba)lEzh 


which can be recognized as the ratio estimator for a 
population mean, originally proposed by Hajek (1971). It 
also follows that b, =(¥,) and that an approximately 
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design-unbiased estimator for the covariance matrix of the 
treatment effects is given by (29) with bi, x, = J,.. 

Hors tty, T, =N=N, then the ratio estimator (30) 
corresponds with the regular Horvitz-Thompson estimator. 
This condition is satisfied in the case of a CRD or an RBD 
embedded in a simple random sampling design, an RBD 
embedded in a stratified simple random sampling design 
where strata are used as block variables or a CRD 
embedded in a stratified simple random sampling design 
with proportional allocation. Under the condition N = N, 
expressions for the design variance of the Horvitz- 
Thompson estimator are given by (27) and (29), where 

Y,—b,x, and y, —bi, x, are replaced by y,,. Variance 
expressions for the Forvitz Thompson estimator are more 
complicated if N #N, see Van den Brakel (2001). 


2.5 The Wald Test 


Inserting the design-unbiased estimators for the 
subsample means and the covariance matrix of the contrasts 
between these subsample means into (12) leads to the 
design-based Wald statistic 


W = Yeeng C'(CDC')! C Yoarc- (31) 


It is proven in the appendix that this expression can be 


simplified to: 
x eS 2 
Vae 1 x Y, greg 
oR ror ee (32) 


For general sampling schemes, the asymptotic distribu- 
tion of this test statistic will be unknown. However, if the 
sampling design is simple random sampling without 
replacement and the experimental design is a CRD, then 
Lehmann (1975, appendix 8), based on the work of Hajek 
(1960), gives sufficient conditions under which E, aps dS 
asymptotically multivariate normal distributed with mean 
EE, (Eq,|m, s)=E and covariance matrix V =Coy, 
PCE apenas ABNOGL (I. [nneyeitee 53% and 
(N-n,,.) 3°: (Ey,m) > N(E, V). Hence, (CE,;|m)—> 
N(CE, CVC’), with CE =(1/N)>™, Ce,. Since the Ce, 
are mutually independent random variables with means 
equal to zero and covariance matrix C,C’ we have by 


the ordinary central limit theorem (CE)—> 
N(O, 1/N*)>%, CX, C’). Combining both limit distri- 
butions we obtain that unconditionally CE,,;—> 


N(0, CVC’) and thus CY,peg— N(CB, CVC’). Asa 
result it follows under the null hypothesis that W is 
asymptotically chi-squared distributed with K —1 degrees 
of freedom (Searle 1971, theorem 2, chapter 2). For more 
complex sampling designs it is usually conjectured that 
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CYatees N(CB, CVC’). Then W is still asymptotically 
chi-squared distributed with K —1 degrees of freedom. The 
validity of this conjecture has been confirmed by simulation 
studies, see section 4 and Van den Brakel (2001). 


2.6 Pooled Variance Estimators 


In the case of an RBD the n,, sampling units of s are 
divided into JK groups of size n,. For each of these JK 
subsamples separate population variances Sa have to be 
estimated. If the number of experimental units n jx available 
for the estimation of these population variances becomes too 
small, then these estimates might become unstable. In such 
situations, more stable estimates can be obtained by pooling 
estimates of the population variances within the blocks. 

The residuals of the generalized regression estimator, 
(y, —b(x,), only depend on the k™ treatment effect 
through the measurement errors €,. Under the assumption 
that ©), =0°I in (3) for all ic U, it follows that the oe 
within each block are identical parameters, 1. é., 
Sp, == Sp, =S¢, for j=1,2,...,J. Under this 
Agape it is efficient to use a e5siga estimator for 5? £3 


An LSS, ik 
Si, 
‘ (1, 1) k=] i= 
n 2 
Ni Vix =b,X;) ye by ee —b;,x;) (33) 
Nt, CaS NT; 


or alternatively 


A (nj, 8 eae 


. x 2 
Ni, Vix sl 9 Py s Ni, (Viz —b,x;) . (34) 
NT; Nix i=l NT; 


There are several special cases where the design-based 
Wald statistic coincides with the F-statistics known from 
more standard model-based analysis procedures. Consider 
an RBD embedded in a self-weighted sampling design 
where sampling units are allocated proportionally to the 
treatments over the blocks, ie, 7m, =n,,/N and 
N,/n,, =n,,/n,, for all j=1,...,J. Then, it follows 
from the results Outained for the ratio estimator (30) that 
Y, erogr lly, ditt Ya = Ji, and bix,=y,,. Denote 
a =1/n,, Vii Vix and ees = Lin Dini Diet Vik» then it 
Siete that 


Nix 1 K 
LYY yn =F ap 


Ni, k=l i=l Nis k=1 i=l 
1 K ny 


DU), My Rima 
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If n,, ~n,,—1, then it follows under the pooled variance 
estimator (33) that 


y 
Vik —b,X; OER y Si Vik eLiye 
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Denote y, =1/n, Xi4 yy. Under the pooled variance 


estimator (34) it follows that 
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Substituting these pooled variance estimators into the Wald 
statistic (32), leads to 


1 K 
ae » Ny (Vax ye ~n..(5.."} (37) 
k=l 


F, 


where dp is given by either (35) for a=1 or (36) for 
a=2. It canbe recognized that W /(K —1) in (37) with d> 
the pooled variance estimator (35), corresponds with the 
F-statistics of an ANOVA for a two-way layout without 
interactions. If d p, (36) is inserted, then W/(K —1) 
corresponds with the F-statistics of an ANOVA for a 
two-way layout with interactions (Scheffé 1959, chapter 4). 
A pooled variance estimator for a CRD follows as a special 
case from (35) and (36). Under both estimators it follows 
that W/(K —1) corresponds with the F-statistics of the 
one-way ANOVA (Scheffé 1959, chapter 3). 


2.7 Advantages of RBD’s 


The main advantage of RBD’s is the elimination of the 
variation between the blocks in the analysis of treatment 
effects. Sampling units from the same stratum, PSU or 
cluster generally have a higher degree of homogeneity 
compared with sampling units from different strata, PSU’s 
or clusters. This suggests using sampling structures like 
strata, PSU’s or clusters as block variables in an RBD 
(Fienberg and Tanur 1987, 1988). Using these sampling 
structures as a block variable in an RBD, ensures that each 
stratum, PSU or cluster is sufficiently represented within 
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each subsample. Also interviewers are potential block 
variables, since this eliminates the variation in the observa- 
tions due to fixed or random interviewer effects specified in 
measurement error model (1). For surveys where inter- 
viewers collect data by means of CAPI in seperated 
geographical areas, blocking on interviewers also eliminates 
this regional variation from the target variable. The power of 
an experiment is maximized if sampling units are allocated 
proportionally to treatments over the blocks, ie., 
n,/n,, =n,,/n,, forall j=1,...,J (see Van den Brakel 
2001, chapter 6). This allocation is better preserved if 
interviewers are used as the block variables, since response 
rates between interviewers differ substantially. Unrestricted 
randomization by means of a CRD is not always feasible 
from a practical point of view. For example in CAPI 
surveys where interviewers collect data in geographical 
areas surrounding their places of residence, restricted 
randomization of sampling units within interviewers or 
geographical regions which are unions of adjacent inter- 
viewer regions might be required to avoid an unacceptable 
increase in the travel distance of the interviewers. This 
naturally leads to RBD’s with interviewers or regions as 
block variables. 


3. Design-Based Linear Regression Analysis 


A design-based linear regression might be considered as 
an alternative for the analysis of embedded experiments. 
The observations are assumed to be the outcome of a linear 
regression model y, = B’x,+e,, with x, the vector 
containing Q explanatory variables, B the vector containing 
the regression coefficients, and e, a residual. This model is 
mainly determined by the experimental design and contains 
the treatment factors, local control factors (e.g., blocks) and 
covariates as explanatory variables (see e.g., Montgomery 
2001). Potential covariates are the auxiliary variables in the 
weighting scheme of the generalized regression estimator. 
The parameters of interest are the regression coefficients in 
the finite population, which are defined by B= 
(X'X)'X'y, where X is the NxQ design matrix of the 
experimental design, and y a WN vector containing the 
observations obtained under the different treatments, as if 
the entire finite population is included in the experiment. 
The design matrix conceptually divides the population into 
K subpopulations or domains, which are observed under 
each of the K treatments of the experiment. The size of each 
subpopulation is determined by the fraction of sampling 
units assigned to each treatment in the experiment. A 
design-based estimator for the regression coefficients is 
given by B =(X‘MI'X,) 1 X‘I'y,,, (Sarndal et al. 1992, 
section 5.10). Here X, is the nxXQ design matrix, y, a 
vector containing n observations obtained under the 
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different treatments of the n units included in the sample, 
and II a nxn diagonal matrix containing the first order 
inclusion probabilities m, of the sampling design. The 
approximate covariance matrix of B, is given by (Sarndal 
et al. 1992, section 5.10) 


Var(B) = (X'X)"A(X'X) (38) 
with A= Var,(X'IT"y, —X‘I1'X,B). The elements of 
A are given by 

Xye 


Nig€i Xr &i 
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with e,=y,—f’x;. Hypotheses about the subset of 
regression coefficients that reflect the treatment effects 
are tested with a Wald test, see e.g., Skinner (1989). 

The major drawback of this approach is that the 
estimation procedure doesn’t account for the random 
assignment of sampling units to treatments according to the 
experimental design. In doing so the subsample estimates 
are erroneously treated as if they were domain estimates, 
which results in wrong design-variances. The covariance 
matrix of the treatment effects (28), derived in section 2.4, 
illustrates that the superimposition of the experimental 
design on the sampling design determines which specific 
features of the sampling design are nullified or preserved. 
For example, the effect of stratified sampling or two-stage 
sampling on the variance of the treatment effects is nullified 
under a CRD. This effect, however, is ignored by the linear 
regression approach, since Var(B) only accounts for the 
variance of the sample design. Disregarding the experi- 
mental design in the variance estimation procedure becomes 
even more obvious under a complete enumeration of the 
finite population. Due to the experimental design, the entire 
finite population is randomly divided into K subsamples and 
the parameters under the different treatments are still 
estimated with a nonzero design variance. In this situation it 
follows for the linear regression approach that B= and 
that Var(B) is equal to zero because the design-variance 
induced by the experimental design is ignored. This 
contrasts with (28) that under a complete enumeration still 
reflects the design-variance due to the experimental design. 

It is not immediately evident how the linear regression 
approach can be adjusted to allow for the randomization due 
to the sampling design as well as the experimental design. 
Conditionally on the realization of the sample, the 
experimental design can be described by first and second 
order inclusion probabilities. Let Tit denote the first order 
inclusion probability that the i sampling unit is assigned 
to the k" treatment and let m\, denote the second order 
inclusion probability that i" sampling unit is assigned to 
the k' treatment and the i’ sampling unit is assigned to 
the k’ treatment. A design-based estimator for B that 
accounts for the sampling design and the experimental 
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design is given by B=(X‘II*'X,)'X‘II*'y,, where 
II", denotes the nxn diagonal matrix with first order 
inclusion probabilites 1; = 1; Thijs: An approximation for 
the covariance matrix of B is given by (38), where A is 
obtained by conditioning on the realization of the sample, 


1.é., 
A = Var.E,(X‘II*'y, — X/1""X,B) 
+E, Var,(Xi My, —X/ 1" 'X,B). 


This leads to the following expression for the elements of 
AS 
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which has the variance structure of a two-phase sample, 
where the first phase corresponds to the sampling design 
and the second phase to the experimental design. The 
sampling units are, according to the experimental design, 
assigned to only one of the K treatments. As a result it 
follows that Tens =0 for k #k’, and i=i’, which hampers 
the derivation of an approximately design-unbiased 
estimator for the covariance terms of Var(B), see also 
Van den Brakel and Binder (2000, 2004). In the analysis 
procedure proposed in section 2, this problem is 
circumvented by deriving a design-based estimator for the 
covariance matrix of the contrasts of CY¢azq instead of an 
estimator for the covariance matrix of Yar itself. 


4. Simulation Study 


In subsection 4.1, a simulation study is conducted to 
evaluate the performance of the design-based estimator for 
the covariance matrix of the contrasts between the 
subsample estimates CDC’ with diagonal elements (29) as 
well as the design-based Wald statistic W defined by (32) to 
test hypotheses about these contrasts. Subsequently, this 
design-based Wald test, the design-based linear regression 
approach and a standard ANOVA are applied to the analysis 
of a CRD and an RBD in subsection 4.2. 


4.1 Evaluation of the Unbiasedness of CDC’ and the 
Distribution of W 


In this simulation study, a measurement error model 
without interviewer effects is assumed, i.é., 
Yu =U, +B, +E. (39) 


An artificial population consisting of 3 strata, 450 PSU’s 
and 109,500 SSU’s is generated by randomly drawing 
strictly positive values for the intrinsic values u, of a target 
parameter. The sizes of the PSU’s in the population are 
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unequal. The intrinsic values are generated in two steps. 
First, a positive value for each PSU in the population is 
drawn from a uniform distribution. Subsequently a positive 
value for each SSU, also drawn from a uniform distribution, 
is added to the value obtained for the PSU in the first step. 
Within each stratum different lower and upper boundaries 
and interval-widths for these uniform distributions are 
applied, such that the population can be stratified into three 
relatively homogeneous subpopulations. The intervals of the 
uniform distributions that are applied in the second step are 
smaller than the intervals of the uniform distributions in the 
first step. This resulted in a population where the intrinsic 
values for the SSU’s within each PSU are clustered. The 
structure of the population is summarized in Table 1. 


Table 1 
Population 
Stratum Number of Number of Mean Std. Min. Max. 
PSU’s SSU’s dey. value value 
1 70 6.250 922,183) 2001 7,607 50,915 
2 130 18,250 6,128 1,866 3,007 10,490 
3 250 85,000 1,407 732 S12) 3,248 
Total 450 109,500 3,380 5,803 2) 50,915 


Samples are drawn repeatedly from this population by 
means of stratified two-stage sampling without replacement 
with unequal inclusion probabilities. The inclusion proba- 
bilities are chosen proportionally to the size of the target 
parameter. The sample sizes for the different strata are 
summarized in Table 2. For each sample, a new measure- 
ment error is generated for each population element. These 
measurement errors are drawn from a normal distribution 
with a mean equal to zero and a standard deviation pro- 
portional to the size of the intrinsic values. The range of the 
standard deviations varied from 1,000 for the SSU’s with 
the largest intrinsic values in the first stratum to 10 for the 
SSU’s with the smallest intrinsic values in the third stratum. 


Table 2 
Sample Design 
Stratum Number of PSU’s Number of SSU’s 
1 25 900 
2 30 1,080 
3 50 1,800 
Total 105 3,780 


Finally, the samples are randomly divided into four 
subsamples according to an experimental design, each with 
a size of 945 SSU’s. Two different experimental designs are 
applied. In the first design, the SSU’s are randomized over 
the four different treatments according to a CRD. In the 
second design, the SSU’s are randomized over the four 
different treatments according to an RBD, where the three 
strata are used as the block variable. Within each block or 
stratum, 1/4 of the SSU’s are randomly assigned to each 
treatment. Under both experimental designs, four different 
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sets of treatment effects are applied, one under the null 
hypothesis and three under different alternative hypotheses. 
This resulted in eight different simulations, which are 
specified in Table 3. Each simulation is based on 
R=100,000 resamples. Observations for the target para- 
meter are obtained by adding a measurement error and a 
treatment effect to the intrinsic values according to (39). 


Table 3 
Summary of Simulation Settings 


Treatment effects 


Experimental design Bi, B> B; By 

CRD RBD 0 0 0 0 
CRD RBD 0 20 40 60 
CRD RBD 0 40 80 120 
CRD RBD 0 80 160 240 


The data obtained in each resample are analyzed with the 
extended Horvitz-Thompson estimator (30). Let ¥, denote 
the subsample estimate obtained under the k™ treatment in 
the r" resample. The vector with the four subsample 
estimates obtained in the r resample is denoted by Y” = 
(9,95, 93.94) - The vector with the three contrasts in the 
r resample is equal to CY’, with C = (j:—D,j a vector 
of order 3 with each element equal to one, and I the 3x3 
identity matrix. Furthermore, d ; denotes the diagonal 
elements of the estimated covariance matrix, obtained under 
the r" resample. An expression for d’ is given by (29) 
with bix, =¥{. The estimated covariance matrix of the 
treatment effects is equal to CD’C’, with D’ = 
diag(d’,d;,di,d'). Finally W’ =(CY’)'(CD'C’)" 
(CY’) denotes the Wald statistic observed in the r™ 
resample. Based on the R =100,000 resamples within each 
simulation, the population parameters under the different 
treatments can be approximated by 

eh Oa We 

eens 
with Y =(Y,,Y,,¥,,¥,)’. From (10) it follows that the real 
treatment effects in the measurement error model can be 
approximated by CY ~ CB. Furthermore, the mean of the 
estimated resample covariance matrices can be calculated as 


R 
CDC = => cD’c, 
r= 


and the mean of the resample Wald statistics as 
W=—) WwW’. (40) 


An approximation of the real covariance matrix of the 
treatment effects is given by 


R ~ — ~ — 
OO ye (Y’-Y)(Y'-Y)'C’. (41) 
we yell 
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The performance of the variance estimation procedure is 
evaluated by comparing CDC’ to CVC’. If the derived 
variance estimator CDC' is approximately design- 
unbiased, then the mean of resample covariance matrices 
CDC’ must tend to the real covariance matrix CVC’, for 
R-—co, An impression of the precision of the derived 
variance estimator is obtained by calculating the standard 
deviation of the elements of CDC’, and is denoted by 
6(CDC'). The diagonal elements of D are denoted d,. 

If CY oreo ~ N(CB, CVC’), then it follows that 
W > Xfce) With K-1 the number of degrees of 
freedom and 6=1/2(CB)'(CVC')'(CB) the non- 
centrality parameter of the chi-squared distribution. In the 
simulation study, the non-centrality parameter under the 
alternative hypotheses can be calculated by inserting (41) in 
the expression of 5. Subsequently, the power of the Wald 
statistic for a particular set of treatment effects can be 
calculated by P(W)= P(X x14) > ei) where 
Mirsaten| denotes the (1-@)" percentile point of the 
central chi-squared distribution with K-—1 degrees of 
freedom. The performance of the Wald statistic is evaluated 
by comparing P(W) with the simulated power, which is 
defined as the fraction of significant runs observed in the R 
resamples, i.e., 


sim 1 x r 
P Oey ey den Peat 
r=l 


where /(B) denotes the indicator variable which is equal to 
one if B is true, and equal to zero otherwise. The results of 
the simulations are summarized in Tables 4.1 through 4.8. 

The means of the subsample estimates Y, under the null 
hypotheses in Tables 4.1 and 4.5 slightly overestimate the 
population mean in Table 1. This difference can be 
attributed to the bias of the extended Horvitz-Thompson 
estimator. The means of the contrasts between the 
subsample estimates CY, however, almost perfectly agree 
with the real treatment effects CB. The means of the 
resample covariance matrices CDC’ tend to the values of 
the real covariance matrices CVC’, which illustrates that 
the variance estimation procedure, derived in section 2.4, is 
approximately design-unbiased. The relative precision of the 
diagonal elements of CDC’ is about 10.5% under this 
particular sample size. The simulated power based on the 
resample distribution of the Wald statistic approximates the 
real power reasonably well. On the average the simulated 
power is slightly higher. The expected value of the chi- 
squared distribution is equal to E(Xix-aJs)) =(K -—1)+26 
(Searle 1971, section 2.4.h). If the resample distribution of 
the Wald statistic tends to a etal: then the mean of the 
resample Wald statistics W (40) must tend to the expected 
value of the chi-squared distribution. Indeed, it follows from 


Tables 4.1—4.8 that W =~(K-—1)+206. Moreover, the 
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hypothesis that the resample distribution of the Wald 
statistic under the null hypothesis is equal to the central chi- 
squared distribution, is tested with the one-sample 
Kolmogorov-Smirnov test. This hypothesis is not rejected at 
a significance level of 5% for either the CRD or the RBD, 
and confirms the conjecture that the Wald statistic is 
asymptotically chi-squared distributed under stratified two- 
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stage sampling without replacement, unequal inclusion 
probabilities, and relatively large sampling fractions. If the 
simulations under a CRD are compared to an RBD, then it 
follows that blocking on strata results in a substantial 
increase of the precision of the estimated contrasts and the 
power of the tests in this particular situation. 


Table 4.1 
Simulation Results CRD, B = (0, 0, 0, 0)’ 
Subsamples Contrasts Wald statistic 
k By y d, Diagonal elements of aM P(W) P'™ (Ww) 
r"0 S392 14,311 k—k’ CY CVC’ CDC’ o(CDC') 0.050 0.05000 0.05072 
Di PO 3,392 14,305 ee 0 28,725 28,616 3,019 0.025 0.02500 0.02506 
8) 40 3,392 14,306 —3 0 28,892 28,616 3,019 0.010 0.01000 0.01017 
4 ae) 3,390 14,292 _ 2 28,787 28,603 3,019 W : 3.01591 5: 0.0000 
Table 4.2 
Simulation Results CRD, B = (0, 20, 40, 60)’ 
Subsamples Contrasts Wald statistic 
ke iB) ‘6 ai Diagonal elements of a P(W) P'™(W) 
1 0 3,392 14,307 k—k’ CY CVC CDC’ o(CDC') 0.050 0.05842 0.05925 
26 Ne 3,412 14,307 1- — 20 28,635 28,614 3,026 0.025 0.03008 0.03040 
3 AG) 3,432 14,314 1-3 — 40 28,918 28,620 3,033 0.010 0.01257 0.01255 
4 60 3,450 14,291 1-4 —58 28,624 28,597 3,025 W : 3.14037 5: 0.0697 
Table 4.3 
Simulation Results CRD, B = (0, 40, 80, 120)' 
Subsamples Contrasts Wald statistic 
k By i d, Diagonal elements of o P(W) P™(W) 
1 0 3,392 14,314 k—k’ CY CVC’ CDC’ o(CDC') 0.050 0.08503 0.08523 
2 40 3,432 14,307 1-2 —40 28,597 28,621 3,020 0.025 0.04704 0.04760 
3 80 3,472 14,307 -3 — 80 28,947 28,622 3,022 0.010 0.02150 0.02165 
4 120 2511 14,295 —4 —119 28,713 28,609 3,021 W : 3.55406 6: 0.2783 
Table 4.4 
Simulation Results CRD, B = (0, 80, 160, 240)’ 
Subsamples Contrasts Wald statistic 
oes 1% d, Diagonal elements of a P(W) pP™ (Ww) 
1 0 3,392 14,306 kk’ CY GVC’ PF CDE’ciCDE) 0.050 0.21198 0.2116 
2 80 3,472 14,310 1-2 — 80 28,748 28,616 3,026 0.025 0.13809 0.13885 
3 60 3552 14,312 1-3 — 160 28,784 28,618 3,030 0.010 0.07703 0.07781 
AAO 3,631 14,291 1—4 — 239 28,538 28,598 3,022 W : 5.22065 5: 1.1203 
Table 4.5 
Simulation Results RBD, B = (0, 0, 0, 0)’ 
Subsamples Contrasts Wald statistic 
k B, x d, Diagonal elements of a P(W) P™ (W) 
l 0 3,389 3,088 k—k’ CY CVC’ CDC’ o(CDC') 0.050 0.05000 0.05168 
2 0 3,389 3,088 1- 0) 6,175 6,176 647 0.025 0.02500 0.02640 
3 0 3,389 3,088 1=3 0 6,216 6,176 647 0.010 0.01000 0.01060 
4 0 3,389 3,088 1-4 0 6,217 6,176 647 W : 3.01483 5: 0.0000 
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4.2 Comparison of Three Analyis Procedures 


Furthermore, three possible analysis procedures for 
embedded experiments are compared, i.e., the design-based 
Wald test proposed in section 2, a standard ANOVA where 
all observations are equally weighted and assumed to be 
iid. and the design-based linear regression approach 
described in section 3. To this end two samples, each with a 
size of 3,780 SSU’s are drawn from the finite population 
specified in Table 1, by means of the stratified two-stage 
sample design, which was also used in the previous 
simulation (see Table 2). For one sample, the SSU’s are 
randomly divided into four subsamples, each with a size of 
945, by means of a CRD. For the other sample the SSU’s 
are randomly divided into four subsamples, each with a size 
of 945, by means of an RBD where the strata are used as the 
block variables. Both experiments are conducted under the 
alternative hypothesis where the treatment effects in the 
finite population are equal to B= (0, 80,160, 240)’. The 
design-based linear regression analysis is performed with 
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Stata’s SVYREG procedure that accounts for the stratifi- 
cation, two-stage sampling and the unequal selection 
probabilities of the sampling design (StataCorp. 2001). The 
ANOVA is performed with Stata’s ANOVA procedure 
(StataCorp. 2001). The analysis results under a CRD are 
summarized for the design-based Wald test in Table 5.1, for 
the design-based linear regression approach in Table 5.2, 
and for the ANOVA in Table 5.3. Similarly, the analysis 
results under an RBD are summarized in Tables 6.1, 6.2, 
and 6.3. 

As emphasized in section 3, the linear regression 
approach ignores the design variance due to the random- 
ization of the sampling units over the subsamples with 
respect to the experimental design. As a result the standard 
errors of the treatment effects are smaller under the linear 
regression approach than in the case of the design-based 
Wald test, and the design-based regression approach results 
in smaller p-values for the test of treatment effects. 


Table 4.6 
Simulation Results RBD, B = (0, 20, 40, 60)’ 
Subsamples Contrasts Wald statistic 
ee TB: y d, Diagonal elements of Oo P(W) P'™(W) 
1 0 3,390 3,090 k—k’ CY CVC’ CDC’ o(CDC') 0.050 0.09099 0.09371 
2 20 3,410 3,089 1—2 —20 6,225 6,180 648 0.025 0.05096 0.05238 
3 40 3,430 3,090 eas — 40 6,177 6,181 648 0.010 0.02365 0.02405 
4 60 3,450 3,090 14 — 60 6,184 6,180 649 W : 3.66771 5: 0.3226 
Table 4.7 
Simulation Results RBD, B = (0, 40, 80, 120)! 
Subsamples Contrasts Wald statistic 
Bie Be % d, Diagonal elements of a PW) P'™(W) 
1 0 3,389 3,088 k-k’ CY CVC' CDC’ o(CDC') 0.050 0.23999 0.24310 
2 40 3,429 3,088 —2 — 40 6,178 6,176 647 0.025 0.15999 0.16302 
3 80 3,469 3,088 1-3 — 80 6,183 6,176 649 0.010_ 0.09181 0.09458 
4 120 3,509 3,088 4 =a 2() 6,189 6,176 649 W : 5.62182 6: 1.2905 
Table 4.8 
Simulation Results RBD, B = (0, 80, 160, 240)’ 
Subsamples Contrasts Wald statistic 
k By e d, Diagonal elements of Oo P(W) P'™(W) 
1 0 3,390 3,091 k—-k’ CY CVC’ CDC’ o(CDC’) 0.050 0.77340 0.77712 
p) 80 3,470 3,090 =) 8) 6,204 6,180 648 0.025 0.68135 0.68789 
Bee LOO 3,550 3,090 3 = 160 6,210 6,181 648 0.010 0.55796 0.56701 
4 240 3,630 3,090 4 — 240 6,214 6,181 648 W : 13.48594 Os Selle 
Table 5.1 


Design-based Wald Statistic, CRD 


Subsamples Contrasts Wald statistic 
k By Vx k-k Ye ~Ve a ty W df p-value 
1 0 3,414 1-2 — 124 164.915 2.4740 3 0.480 
2 80 3,538 1-3 — 182 162.542 
5 160 3,596 1-4 — 249 164.782 
4 240 3,663 
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Table 5.2 
Design-based Regression, CRD 
Source Coefficient Std. err. Wald statistic d, —value 
treatment 2.907 3 0.4062 
treatment | = [eye 177.60 
treatment 2 — 58.36 175.56 
treatment 4 66.79 170.46 
constant 3,596.47 194.75 
Table 5.3 
Standard ANOVA, CRD 
any Vx Contrast ANOVA 
1 0° 8021 k-k -yp—yy—- Source df MS F p-value 
2 80 8094 1-2 S73 Between treatments 3 14,432,816 0.14 0.9376 
sy iGO FESS hess} 66 Residual 3,776 104,924,668 
AAO Ole = YD) Total 3,779 
Table 6.1 
Design-based Wald Statistic, RBD 
Subsamples Contrasts Wald statistic 
Si PU RG Se 2 ese ane W df p-value 
1 OReS3595 il=2 =) 81.247 9.93011 3 0.0192 
2 80 3:420 iba} = 120 80.697 
3 160 3,515 jl = 733i) 82.383 
4 240 3,626 
Table 6.2 
Design-based Regression, RBD 
Source Coefficient Std.err. Wald statistic d —value 
Block 
Block 2 —17,068.28 2,556.46 
Block 3 —21,999.39 2,540.98 
Treatment 18.4212 3 0.00036 
Treatment | ISI 74.84 
Treatment 2 —246.78 60.05 
Treatment 3 —97.9] 73.39 
Constant 23,589.64 2543.25 
Table 6.3 
Standard ANOVA, RBD 
kere: Contrast ANOVA 
1 0 8815 k-k’ Yy,,-Y,. Source df MS F p-value 
D, 80 8,150 1-2 665 Between blocks Z 1.6773 E+11 
3 etd) <peysyeyey = 3! 249 Between treatments 3 84,377,227 1.99 0.1126 
4 DAO +0 ee et 69 Residual 3,774 42,310,035 
Total 3,779 ~—- 131,089,505 


The standard ANOVA is a naive approach, since it 
ignores the stratification, clustering and selection of 
sampling units using inclusion probabilities that are chosen 
proportional to the value of the target parameter. The net 
result of ignoring these aspects of the sampling design in the 
analysis is a severe over-estimation of the subsample 
estimates as well as the standard errors. Compared to the 
other two design-based procedures, this results in larger 
p-values for the test of treatment effects. 
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Another important advantage of the design-based Wald 
test compared to the design-based linear regression ap- 
proach is that the Wald test always concerns the differences 
between the subsample estimates, which facilitate the 
interpretation of the results. This property is particularly 
important for embedded experiments aimed at the quanti- 
fication of trend disruptions in the parameters of a survey 
due to adjustments in the survey design. In the case of a 
CRD, the linear regression model consists of one intercept 
parameter and three coefficients for the treatment effects. In 
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this particularly simple situation, the coefficients for the 
treatment effects are exactly equal to the differences 
between the subsample estimates. This property, however, 
doesn’t hold for the treatment effects obtained under more 
complicated models, as for example in the case of the RBD. 


5. Discussion and Conclusions 


In this paper we discuss how the statistical methodology 
of randomized experiments and random survey sampling 
can support the design and analysis of experiments 
embedded in ongoing sample surveys. The sample survey 
design forms a prior framework for the application of 
principles, known from the theory of experimental designs, 
like randomization and local control by means of blocking 
on strata, PSU’s, clusters or interviewers. To test hypotheses 
about the estimates of finite population parameters 
observed under different treatments of the experiment, a 
design-based Wald statistic for the analysis of CRD’s and 
RBD’s embedded in general complex sampling designs is 
derived using the Horvitz-Thompson estimator and the 
generalized regression estimator. The application of 
randomized sampling from a finite population in combina- 
tion with this design-based analysis procedure enables us to 
generalize the results of the experiment observed in the 
specific sample to the entire survey population. 

Since we allow for general complex sampling designs, a 
rather complicated expression for the covariance matrix of 
the treatment effects with nonzero off-diagonal entries is 
expected. The derived estimator for this covariance matrix, 
however, has a structure as if the sampling units were drawn 
with replacement and with unequal selection probabilities. 
No second order inclusion probabilities or design-cova- 
riances between the treatment effects are required, which 
simplifies the analysis considerably. For example, in the 
case of simple random sampling without replacement this 
result entails that the finite population correction factor 
should be disregarded in estimating the variance of 
contrasts. As a result a Wald statistic, derived from a design- 
based perspective under general complex sampling designs, 
is obtained that still has the appealing relatively simple 
structure of standard model-based analysis procedures. 

For CRD’s and RBD’s embedded in a self-weighted 
sampling design analyzed with the extended Horvitz- 
Thompson estimator and a pooled variance estimator, the 
Wald statistic coincides with the /F-statistic of an 
ANOVA for the one-way and two-way layouts. For the 
analysis of the embedded two-treatment experiment, a 
design-based version of the t-statistic can be derived as a 
special case of the Wald statistic. Expressions and more 
details about this design-based f-statistic and_ its 
relationship with Welch’s t-statistic and the standard 
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t—statistic can be found in Vanden Brakel and Renssen 
(1998), Van den Brakel (2001) or Van den Brakel and Van 
Berkel (2002). 

The analysis procedure proposed in this paper is 
implemented in a software package, called X-tool. This tool 
will become available as a component of the Blaise survey 
processing software package, developed by Statistics 
Netherlands. 


Appendix 
Properties of the randomization vectors p,, 


For CRD’s and RBD’s the randomization vectors p,, are 
defined by (14) and (15). As a consequence of the random- 
ization mechanism of the experimental design, the vectors 
Pp, are random with the following conditional probability 
mass functions. For a CRD we have 


P(r =" 1s] =" and P(p, =0|s) =1-“. 

nN, in, n, 

For an RBD we have 

P| p,=—r, |s; |=—= and P(p, =0|s,)=1-—. 
N iz Ni = 


Properties of these vectors are derived for an RBD. 
Properties for a CRD follow as a special, since a CRD can 
be considered as an RBD with one block. Let w. pr. denote 
“with probability”. 


t — 
PPK = 
2 
an k Cet ee 
“| r0, WwW. pr.: : if ies, 
N ix Nis 
n. 
k 
O w. pr.: 1-— 
nj, 
aye 
PiPix = 
ae Ye 0 
“rrp ow. pr.: 0 ifses, 
Mie M1 jx 
O Ww. pre: 1 
t 
Pi Pix 
Ales 
+ Kk eg Peer: 
Ul WwW. pl. =, ll ies. 1 Gs, 
Ni Nix Lee: (13, “al 
n Ny 
k k > ; 
O w. pr.: 1-—~+——*+__., if ies,, ies, 
n., (n., -1) d : 
J jt Ej 
jt T+ t N ix Tiki: : , 
r,0jy Ww. pr = eS hes) 
eel te Mix M's 
Tig Neh 
k Se . 
O WwW. pr Iw 5 Sa ere 
fi, at 
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t — 
PiPix = 
(ny —1) 
n, n, (Nn, - 

k k : 5 5 
firm ow. pe: —-~—-—, if ies,,ves, 
jk Ni, ie= 

sg Niet Me SS ” 

O W. pr ——— ———__, if ies,i7es, 
n., (n;, —V) ; ‘ 
dite J+ 
Ni, Wy OE eae 
“rr w. pr.: a i ie serelsy 
he SE Ma Mis 
n. 
k oes ; c 
O W. pletnlieee 5h Tex nos 
Neg De 


The expectation of p, with respect to the experimental 
design is given by: 


Ning Nin 
FE, (pz) =P} px =— fb | —_ rr, + Pp, =0)0=r,. 2) 
Nix Nix 


The following covariances with respect to the experimental 
design can be derived: 


(n. =n) ‘ 


Cov, (Py Px) =a (43) 
Nix 
Cov, (Dix Pix) =— le Fy (44) 
] rey ok ” 
; Sle yen If tepegiand:t-e-s, 
Cov, (Px Piv) =4 (144 -D 


O if ies, and es, (45) 
Cov, (Pix Pix) 
n.—Nn, 
Nai nel rz, if ies, and ves, 
= Nip (n,;, —1) 
O if ies, and i’es, (46) 
Proof of formula (23) 


Under the stated condition that a constant H—vector a 
exists such that a’x, =1 for all i¢ U, and conditional on 
the realization of u;,i=1,...,N, according to super- 
population model (16), it follows that b, in (18) can be 
evaluated as 


i=l 0); i=l 0); 
-] 
= XX; . ACE ue ) 
= 5 | > 5 Vi a B 
i=} 00; ill i 
=b+d+aB,, (47) 


where b denotes the regression coefficients defined by (17) 
and d denotes the regression coefficients from the 
regression function of the interviewer effects on the 
auxiliary variables x;. From result (47) it follows that 
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B'x, = j(b’x, + d‘x,)+B. Since Cj=0 and from 
measurement error model (1) and linear regression model 
(16) it follows that 


Cy; — B*x;) = C(ju; + jy, +B +; — j(b’ + d")x; —B) 
=Ce,, QED. 


Proof of formula (26) for an RBD 


First an expression for Cov, (CE, |m,s) is derived. Let 
elements 
= Vea Dy x). Consequently, e, = y,; — B’x,. Note that 
E mE sCOV CBes ve S)= CEE, Coy BE alae s)C’ with 
=e, jail e 


denote a K-—vector with 


Fein s oe Oba 


Ex. 77). Futhermore note that 


ra 3 [ Pi (Yi —B’x;) |_S Pie; 

Ep ae ty 48 
eur = 3 | 7 N 2 aN (48) 
Using (43) and (46), the diagonal 


Cov, (Ey; |m, 5) can be elaborated as 


elements of 


Var, (Ex.yr |, 5) 


n. t n t 
++ ‘ e. ++ * e. 

= {Ove Petr i.) Pic ©? | ms 
i=l 1,N i=l TN 


Nis 


e e 
== Cov. (p: .D; | 11,5) 
> 2 ,N (Dix Pix | esr 
i=] 
+y s Cov AD ap ealens On 


i=) sigh iN 
n n 2 
j+ j+ 
Wig Mie Sy | se Cik 
(n;,—l) ne pa UN on a NN 


2 
a See e ©ik abla ©ik 
(nj, —1) i=l 1,N Nj. i=l 1;,N 
Using (44) and (45), the off-diagonal elements of 
Cov, (Ey; |, 5) can be elaborated as 


(49) 


Cov, (Exar Eur 1,8) 


n t n t 
4 ae 4 wp Cx 
= Cov, Pix i . Pix i | m,s 
i=l 1,N i=l TN 
Nin. t 
e. e. 
i t i 
Og REN Cov, (Dix Pix’ | mM, Ora N 


J 
a i=] i i 
eS Si “Cov o (Dig Pix’ | m, — 


i=l i‘#i=l iN 
Nix 
1%q) 
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= Hie ie 
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jal (n jt ier jr ick 
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The results (49) and (50) can be written in matrix notation; 


Cov aE: | m, s) 


- fe $ (met 


ja Nis =I 


$ Ye= Bx, aa 


Dae el 


t 
Yi, —B,x; LS y,; —Bx, 
NT; Nis =1 Nt, 


where D denotes a K x K diagonal matrix with elements 


According to (23) it follows that 


Cov, (CEq, |, 5) 


swe Ny Gl Ce, 1 4Ce 
ScvC—). mae — 
ja My -lia \ NT ny, a NT, 


Nt, =, NT, 


The final part of the proof is to take the expectation of 
Cov,(CE,,;|m,s5) with respect to the sampling design and 
the measurement error model. The proof is given for RBD’s 
where PSU’s are block variables. In a two-stage sampling 
scheme, J blocks or PSU’s are drawn from a finite popu- 
gion of J, blocks with first order inclusion probabilities 
Ti. Within each PSU, ,, SSU’s are drawn in the second 
stage with first and ooond, order inclusion probabilities Ti 
and 1), |; The first order en el of the 
individuals in the sample are 7, = 10’, re ;- Furthermore, let 


Dl 


s= zi. 


es 


denote the population mean of the measurement errors of 
the individuals of block j. Then 


i+ &; 
J 
iN; iy 


denotes the Horvitz-Thompson estimator for A j- Now we 
have 
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jal i=l Nt, Ny i=] Nt, Ny i=l Wn 
2 
aS LY ; 
Seesiilered 10B i 
t 
chs 2s -3,| ee -%, 
nia Nm, NGM” (52) 
1 Ev.NS aes pre = hh 
~—{A, -4,}(, -4,) 


Let E, denote the expectation with respect to the first 
stage of ‘the sampling design and E, the expectation 
with respect to the second stage of the sampling design. 
Taking the expectation with respect to the measurement 
error model and the sampling design of the first part of 
(52) and using model assumption (3) leads to 


t 
2 AS An 
EEE, pall ae —— . tesa 
ve Te; Ni, ii \N; ae N; Mi, 


| La APOE Soe a8 ont 
EES se {5 i > ny Ai = 
j) Mia Ga N5Ty, 


ja (OU 


2 
25 i. (53) 
Tit 


Ae ote 
_ nN? 2 
Note that E, (A, -A,)(A, —A,)' in (52) equals the design 
variance of A ; With respect to the second stage of the 
sampling design in block j. Taking the expectation with 
respect to the measurement error model and the sampling 
design of the second part of (52) and using model 
assumption (3) leads to 


2 
AS — AS ane 
Sipe Bes Bae : a, -Z,\h, -i,] 
“jal; ) Mig 
Ne ae i hai Te ££; 
——E (%., — 1 TC. ,) —+— 
1) De ii|j al feet / eT eee TT 
1; Nj ict i\ ‘lpi 
a ae ea 
oe SY ifs, (54) 
oe. I i 


With results (52), (53) and (54) we can elaborate the second 
term on the right hand side of the equal sign of (51) as 


coe wy cama CGA yA lie AEs 
Qa el eee 
m >» ove ie a oe | 
Nix F C Cc 
Cy 1s Coy) _ Teg GanGy (55) 
NT; Ni at NT; ON i=l Tt; 
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Finally, it follows from (51) and (55) that 
E,E Cov,(CE,, |m,s) = 


N t 
E,,E,CDC’ aos CxiC 
NO a 1; 


1 


QED. 


The derivation for an RBD where strata are block 
variables follows directly as a special case from an RBD 
where PSU’s are block variables with Tt =1, Ti, =T;, 
Ti j;=%y and J=J,.The proof for an RBD where 
clusters are block variables follows directly as a special case 
from an RBD where PSU’s are block variables with Tit; = 
and Tin jul. i 

The expectation of Cov,(CE,,,|m,s) with respect to 
the sampling design and the measurement error model for 
an RBD where interviewers are the block variables does not 
follow as a special case from an RBD where PSU’s are 
block variables. Since the block variables are not directly 
linked with the sampling design, the blocks should be 
considered as domains where the block size n,, is random 
with respect to the sampling design. The derivation follows 
the same steps as in the proof for blocking on PSU’s and is 


given by Van den Brakel (2001). 


Proof of formula (32) 
Matrix D can be partitioned as follows: 


According to Bartlett’s identity (Morisson 1990, chapter 2) 
it follows that: 

A A A A | A A 
(CDC’)* =(d,jj +D.)* = DF ———— DF jj De. 

=a ) wace(D=) JJ 

From this result it follows that 
C'(CDC')'C = C'D'C -—_.— C'D jj DiC 
trace(D™ ) 
eh 
trace(D‘) 


Inserting (56) into (31) leads to (32), QED. 


=p'- D7 jj'D". (56) 
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Domain Estimators for the Item Count Technique 


Takahiro Tsuchiya ' 


Abstract 


The item count technique, which is an indirect questioning technique, was devised to estimate the proportion of people for 
whom a sensitive key item holds true. This is achieved by having respondents report the number of descriptive phrases, 


from a list of several phrases, that they believe apply to themselves. The list for half the sample includes the key item, and 
the list for the other half does not include the key item. The difference in mean number of selected phrases is an estimator of 
the proportion. In this article, we propose two new methods, referred to as the cross-based method and the double cross- 
based method, by which proportions in subgroups or domains are estimated based on the data obtained via the item count 
technique. In order to assess the precision of the proposed methods, we conducted simulation experiments using data 
obtained from a survey of the Japanese national character. The results illustrate that the double cross-based method is much 
more accurate than the traditional stratified method, and is less likely to produce illogical estimates. 


Key Words: Indirect questioning techniques; Item count technique; Domain estimators; Survey of Japanese national 


character. 


1. Introduction 


1.1 Indirect Questioning Techniques 


Suppose that a population U is divided into two sub- 
populations U(;, and U;;,, where U ,) is a set of elements 
having an attribute 7, and U;;) is a complement of U,,,. 
One purpose of social surveys is to estimate 


mt=Y = P(Y =1), where 


ie 1 ifke U (7) 
«10 otherwise 


and P(-) denotes the proportion of units having a particular 
value of the variable. For example, when T is “supporting 
the present cabinet,” 7 indicates the cabinet support rate, and 
when T is “using a certain illegal drug,” 2 denotes the 
prevalence rate of drug use. 

In a direct questioning technique, researchers ask 
respondents “Do you belong to U(;)?,” and directly obtain 
the indicator value y, as “‘yes” or “no” (Cochran 1977, page 
50). When every respondent has an equal inclusion proba- 
bility, a sample mean y serves as one estimator of 7. 

On the other hand, some indirect questioning techniques, 
including the randomized response technique (Warner 
1965), the nominative technique (Miller 1985), the item 
count technique (Droitcour, Caspar, Hubbard, Parsley, 
Visscher and Ezzati 1991), and the three-card technique 
(Droitcour, Larson and Scheuren 2001), are devised because 
some respondents tend to evade sensitive questions, such as 
those concerning highly private matters, socially unaccepted 
or deviant behaviors or illegal acts. The essential feature of 


indirect techniques is that instead of a direct observation of 
Y, another variable X = g(Y,V), which is some sort of 
function of Y and, if necessary, of other random variables V, 
is observed so that respondents feel that their true Y—values 
are not revealed. While this feature is expected to derive a 
truthful answer from evasive respondents, both the 
questioning and the estimation procedures are rather 
complicated compared to the direct questioning technique 
partly because the function g(-) sometimes includes some 
randomization processes. We shall outline two indirect 
techniques below. 

The randomized response is the most popular among the 
indirect techniques, and various modifications have been 
proposed (Abul-Ela, Greenberg and Horvitz 1967; Warner 
1971; Chaudhuri and Mukerjee 1988; Greenberg, Abul-Ela, 
Simmons and Horvitz 1969; Takahasi and Sakasegawa 
1977). Although the randomized response is not the topic of 
this article, we shall briefly outline Warner’s original 
procedure here for reference, because this technique will be 
simulated in a later section. 


1. Prepare two types of questionnaires. In question- 
naire A, respondents are asked “Do you belong to 
Ur) ?,” and in questionnaire B, respondents are 
asked “Do you belong to U;;) ?” 


2. Let p(#0.5) be the predetermined probability. 
Each respondent selects questionnaire A or B with 
probabilities p or 1—p respectively, but no one 
other than the respondent knows’ which 
questionnaire is selected. 
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Suppose X is an indicator variable whose value is 1 if 
the response is “yes” or O if the response is “no.” The 
estimator of 7 is given by 


ee red (1) 


where X is a sample mean of X. 


Since the researchers have no information regarding the 
type of questionnaire selected by each respondent, more 
respondents are expected to give truthful answers than they 
would if asked direct questions. 


The item count technique, which is the subject of this 


article, is not as popular despite its simplicity. The technique 
is also effective when posing sensitive questions, because 
respondents are asked not to answer sensitive questions 
directly but to merely report the number of items that hold 
true with them. The following are the processes of the item 
count technique: 


ie 


Prepare the key item 7, which is the primary focus 
of the study, and G other non-key items £,,..., Eg. 
For example, T is “using a certain illegal drug” as 
mentioned above, and E : is some sort of non- 
sensitive description such as “owning a bicycle.” 


Prepare two types of questionnaires, A and B. In 
questionnaire A, respondents are asked to answer 
the number C% of items that are true with respect 
to themselves among G non-key items. In 
questionnaire B, respondents are asked to answer 
the number C” of items that are true with respect 
to themselves out of G+1 items, including the key 
item T. 


Table 1 lists examples of item lists. Our aim is to 
estimate the proportion of people who use a certain 
illegal drug. The key item is “using a certain illegal 
drug” in the questionnaire B and the other four items 
are non-key items. Except when a response to the 
questionnaire B is C ®—=0 or C? =5, researchers 
cannot detect as to which items hold true with the 
respondent. For example, a respondent will reply that 
four items in the questionnaire B are true, but we 
cannot be sure that the respondent uses the drug at all. 
Hence, it is expected that more respondents using an 
illegal drug will report truthful answers in such a 
scenario than when asked a direct question. 


Divide a total sample into two subgroups, A and B, 
randomly of size n“ and n® so that each question- 
naire is assigned to a corresponding subgroup. 
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Table 1 

Examples of Item Lists 
Questionnaire A Questionnaire B 
How many of the following hold tue | How many of the following hold 
for you? true for you? 
— owning a bicycle — owning a bicycle 
— having travelled abroad — having travelled abroad 
— having called an ambulance — having called an ambulance 
— owning a summer villa — using a certain illegal drug 


— owning a summer villa 


4. The estimator of 7 is given by 
en Ouray Co (2) 


where C“ and C® are the estimated means of C4 
and C” respectively. The justification of (2) is 
explained in section 2.1. When every unit in the 


Zz 


sample has an equal inclusion probability, # can be 


written as 
G+1 B G A 
ZS dls nN. nN. 
T= > G ip > G ae (3) 
c= n c=0 n 


where n“ and n° are the number of respondents 
whose answers are C“ =c and C* =c, respectively. 
Moreover, when an auxiliary variable Z is available 
and its distribution P(Z = z)=m, in the population 
is known, for example from a census, poststrati- 
fication is often used to adjust the sample distribution 
of Z to the population. That is, the poststratified 
estimator of 7 is given by 


et Divink ovat 
ieee ee 
B 


c=0 n c=0 n 
G+l G 
m m 
aes eee le: ous Feoweets\ 
=>) ¢> 2 - De (4) 
c=0 ee MUS, c=0 z nN, 


where ie is the number of respondents for each 
Gimiccand{7 zen, 


G A 
mn 
A A A A A 
nt = Yona’ =S nt,vi =e 
¢=0 hs nN, 


°yn.,n’, and v? are defined in analogous ways. 

One practical merit of the item count technique is that it 
does not demand any randomization devices, which are 
required for the randomized response technique. It is not the 
respondent but a researcher who selects the questionnaire to 
be answered. Hence, the item count technique is easily 
implemented via any self-administered or telephone 
surveys. A more elaborate comparison between the 
randomized response and the item count technique is found 
in Hubbard, Casper and Lessler (1989). 


and n® 
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The questionnaire A is introduced to obtain the distri- 
bution of the number of non-key items. That is, respondents 
to the questionnaire A do not answer the sensitive question. 
Therefore, it is possible to increase the precision of the 
estimator using the double-list version of item count 
(Droitcour et al. 1991), which exchanges the roles between 
the two subgroups. However, we limit our argument in this 
article to a single-list version, because the extension of 
estimators to the double-list version is straightforward. 


1.2 Purpose of this Article 


Thus far, we have focused on the parameter 
m=Y =P(Y=1) of a total population. However, esti- 
mators in subpopulations or domains (Sarndal, Swesson and 
Wretman 1992 page 5) are often required, i.e., either a 
conditional proportion P(Y =1|Z =z) or a joint proportion 
P(Y =1, Z =z) must be estimated, where a population is 
divided into several domains by the Z—value. We refer to 
the variable Z as the domain variable in this article. The 
domain variables often used are demographic characteristics 
such as gender or age. For example, government agencies 
would like to know the proportion of people who use a 
certain illegal drug at each age group. Even though the post- 
stratified estimator ftp, in (4) uses the domain variable Z, its 
aim is an estimation of P(Y =1) in the entire population. 
Our aim in this article is to obtain separate estimations of 
P(Y =1|Z = z) within each domain. 

One simple estimation method is as follows: 


1.  Post-stratify the sample into strata or domains 
based on the Z-value. 


2. In each stratum or domain, separately determine 
p(Y =1|Z =z) using (1) or (2), where p(-) is a 
sample estimate of P(-). 


3. If necessary, estimate p(Y =1,Z=z) by multi- 
plying a known domain proportion, P(Z =z), or 
an estimated domain proportion, p(Z = Z). 


The above method is referred to throughout this article as a 
stratified method because estimates are obtained separately 
in each stratum or domain. Although Rao (2003) refers to 
the above method as a direct estimate, we have avoided the 
use of the term “direct” in order to avoid confusion with the 
term “direct questioning technique.” 

An advantage of the stratified method is that this method 
is applicable to any indirect questioning technique, 
including the randomized response and item count 
techniques. The U.S. General Accounting Office (1999) 
adopts the stratified method to estimate domains under the 
three-card technique. However, one of the serious problems 
of the stratified method is that it often produces illogical 
estimates, especially negative estimates, in the case of the 
randomized response and the item count, as explained later 
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in this article. This is mainly because the reduction of the 
sample size in each stratum increases the standard errors of 
the estimators (Lessler and O’Reilly 1997). For example, 
Droitcour et al. (1991, page 206) “calculated estimates 
separately for the three risk strata” and obtained negative 
prevalence rate estimates of drug use. 

In the case of the randomized response, there is little 
possibility that domain estimators other than the stratified 
method are developed because information concerning the 
type of questionnaire selected by individual respondents is 
unavailable. In contrast, in the item count technique, the 
questionnaire answered by each respondent is known. 
Therefore, the precision of the domain estimators is 
expected to increase when auxiliary information is used, 
specifically contingency tables between Z and C“ or C”. 

In this article, we propose new domain estimators for the 
item count technique, which are referred to as the cross- 
based method and the double cross-based method. In 
addition, we will illustrate the fact that the new estimators 
are more efficient than the traditional stratified method by 
simulating the item count technique using data obtained 
from the survey of the Japanese national character 
concerning the significant attributes of the Japanese 
character. 


2. Domain Estimators for the Item 
Count Technique 


2.1 Stratified Method 


Here, we reformulate the stratified method. Let us 
assume that the following equations hold true for each value 
of c and z. 


Assumption 1. 
PG Ze DSGe He. ViS0Z =z) 
SPiGa= c= t,V—Il7 =z), 
BGs GAY 077) 20. 


These assumptions imply that the difference in the 
distribution between C“ and C* depends solely on Y. 
Question effects, including order effects and context effects 
(Schuman and Presser 1981) are not considered. 


We have the following result based on_ these 
assumptions. 
Stratified Method. 
G+l 
EO 12-2) => Gh Cea clZ= 2) 
c=0 
G 
—) cP(C* =clZ =z) (5) 
c=0 
= Gz = es > (6) 


& 
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where C4 and C?’ are the domain means of C A endo}? | 


Derivation. 
G+l 


Shere” S077) 
c=0 
G+l 
=) eP(C*=¢, Y=0\Z= ee en Z=2) 


c=0 c=0 


G G 
=) cP(C*=c,Y=0|Z=z) +), (c+1)P(C“=c, Y=1|Z=z) 


c=0 c=0 


ae HOV a0Z— 7) eR Cw oral|7—=2)} 


c=0 


+ PCA eek sil A aaa) 


c=0 


G 
= veR(CiScZaz PW =1ZSz). 

c=0 
Transposing the first term to the left-hand side yields the 
stratified method (5). 

The estimator p(Y =1|Z = z) is obtained by substituting 
domain means C.“ and C,’ with their estimators, Gi and 


ak 
DY =1|Z=z)=C2 -C4. (7) 


When the inclusion probabilities are equal for all units in the 
sample, the estimator of P(Y = Kai = z) iS written as 


G 
ni 

pV Z= 2) = sleet ce c—%, (8) 
c=0 *Z c=0 N., 

where OS nan nes and ne are defined in the section 1.1. 


The equations (2) and (3) for the entire population are 
special cases of (7) and (8). 

One merit of the stratified method is that the variance 
estimator of p(Y =1|Z = z) is easily obtained by 


War (p(Y =1|Z=z)) = Var(C2)+ Var (C4). (9) 


On the other hand, as noted in the previous section, the 
reduction of sample size in each stratum increases estimated 
variances in (9). Further, the marginal estimator p(Y =1) 
obtained by using (8) does not correspond to that obtained 
directly by (3), unless ns hie for all z. That is, when 
p(Z = z) is not known, its estimator is given by 


P(Z=Z)= (n’ +n®)K(n* +n®) 
and 


> pW =1|Z =z) p(Z =z) 


(10) 
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When the domain proportion p(Z = z)=m, is available, 
the marginal estimator corresponds to the poststratified 
estimator (4). 


pal |Z= 7 eG) 


G+l 
soos! Dai: “eS oe Aioe 
n., nN, 
= Tre . 
These results indicate that we should use a poststratified 


estimator ftp, with the domain estimators if we use the 
stratified method. 


2.2 Cross-based Method 


In the stratified method, a total sample is divided into 
strata for the purpose of direct estimation of 
P(Y =1|Z =z), which causes sample size reduction. 
Hence, in the cross-based method proposed in this section, 
the joint proportion P(Y =1,Z=z) is estimated first in 
order to use the entire sample, and the conditional 
proportion is subsequently obtained by 


D(Xo= TZ a2) 
Pets = pyr eee 
p(y =1|Z =z) AZSD 
ts Ce pal, Laz) 
or ROSE Dts a eel : 


The term ‘cross-based method’ is used because this method 
uses cross tabulations P(Z = z|C ® —c), as shown in (19). 

For the cross-based method, we assume that the 
following equations hold for each value of c. 


Assumption 2. 


P(C® =c +1, Y =1)= P(C4 =c, Y =D), (11) 
P(C® =0, Y =1) = P(C’ =-1,Y =) =0, (12) 
P(C? =c, Y =0)=P(C4 =c, Y =0). (13) 


These assumptions also imply that the difference in the 
distribution between C“ and C” depends only on Y. 


We have the following result based on_ these 
assumptions. 
Cross-based Method. 
G+l 
PY =1,Z=2)=) P(Z=z|C*=00.,, (14) 


c=l 


where 
OQ} => {P(Ct =d) | P(C =a) 
d=0 


In addition, we assume that P(Z=z|C® =c, Y=1) = 
P(Z=z|C* =c) for every c > 0. This assumption 
would be valid to some degree when both the key and non- 
key items describe the same type of stigmatizing behavior. 
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Derivation. 
Based on the assumptions, we have 


PC =c)=P(C =c, Y=1)+P(C =c,Y =0) 
=P(C.=¢ =h¥=1)4-P(CPei=0)n5) 
The following equation holds for any c. 
PC =CyY =0)HP(C*=c) = P(C*=07Y =D: (16) 
Hence, substituting (16) in (15) gives 
PIC (=o) =P(C* =c=1, Y=) 
PPC? =c)—P(C* =e Vepty.— Ud) 


Summing (17) over c up to some g, we obtain 


Dic? =c)= Sy P(CA =c=1/Y=1) 


Ga) 


Ss {Pie = ec) PC =c.Y =i) 


c=0 


= y P(G* =Cj=r Guage). 


G=0) 


By transposing the terms, we define Q.. 


0.23 (P(e*=4)- PIC? 2) 
d=0 


=P(C = CY =1) 
=P(C? =c + 1¥=)). (18) 
Here tee jon, proportions, PY =i1Z—7z) is 
decomposed as 
G+ 
eee) PZ —7|C chia ec.) 1). (19) 
c=0 


Substituting the equation (18) and the assumption (12) in 
(19) yields the cross-based method. 

The joint estimator P(Y =1, Z=z) is obtained by 
substituting each term of (14) for its estimators. When the 
sample is self-weighting, the estimator is given by 


G+1 n? cal A 
7) =e #5 (s-24} (20) 


c=l N.. d=0 


where 
Ae A 
i= yi N,. 


The conditional estimator p(Y =1|Z =z) is obtained by 
dividing p(Y=1|Z=z) by the domain proportions 
P(Z = z) or their estimators p(Z = Z). 

As noted above, the main feature of the cross-based 
method is that p(Y =1, Z =z) is first estimated using the 


Ps B 
and n, = >: ti 
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entire sample. Hence, the variance of p(Y =1|Z =z) for 
the cross-based method is expected to be smaller than that 
of p(Y =1|Z=z) for the stratified method. Moreover, 
negative values will seldom be obtained in the case of the 
cross-based method, while the negative values will be often 
obtained in the case of the stratified method. Furthermore, 
the marginal estimator p(Y =1) obtained by summing (20) 
is equal to the estimator (3), unless n a = 0 for some c: 


z c=l d=0 
G+ cal A B 
=> >> Li 2 wil 
A B 
c=l d=0 \N n 
G+1 G A G+1 B 
= (oy te PIN abet han 
A B 
c= d=c Nl aenlt 
B A 
G41 nb G ni ; 
SO —— >, ¢ — = ft. (21) 
G=0 n c=0 n 


Of course, when the domain proportions P(Z = z) =m, are 
known, we can use them to obtain a poststratified estimator 
DC == ayrormr(G ad yrin Oo (14). 


pe =a) oe, Me 
In this case, >}, p(Y =1, Z =z) coincides with the post- 
stratified estimator 7tp.. 

One drawback of the cross-based method is that the 
variance of p(Y =1|Z =z) is almost impossible to estimate 
algebraically. Hence, some resampling methods such as the 
jacknife or bootstrap would be necessary. Additionally, 
since it is impossible to determine the more efficient method 
between the stratified method and the cross-based method, 
simulation studies shall be conducted in a later section. 


2.3. Double Cross-based Method 


Before proceeding to the simulation study, we suggest a 
modified version of the cross-based method. In equation 
(19) of the cross-based method, we use P(Z = z|C® = 
In the same way, when P(Z = z|C“ =c) is used, we obtain 


G 
ACC Aaa Sy ae seera (CeO Nae agree OAD 
c=0 


G 
=) PZaglC =O; (22) 
c=0 


Hence, a double cross-based method is obtained by 
combining (14) and (22) as follows: 


P(Y =1,Z=z)= el 


c=0 


(" w4P(Z =z|C4 =c) 


F 10. (23) 
ew PiZ=ziG =c+ 1) 
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where w“ and w” are the non-negative weights for each 
subgroup, the sum of which is equal to one. 

The following equation also holds for the double cross- 
based method of any w* and w®”, unless ni = (Oss0r 
n° =0 for some c. 


>, pY =1,Z =z) =8. (24) 


3. Numerical Experiments 


3.1 Data Set 


In order to compare the precision of the estimators, we 
conducted simulation experiments using data obtained from 
the survey of the Japanese national character (Sakamoto, 
Tsuchiya, Nakamura, Maeda and Fouse 2000). Although 
the respondents were selected via a stratified two-stage 
sampling from Japanese aged 20 and over, we neglect the 
sampling design because the collected sample of N = 1,339 
is treated as the “true” population in this experiment. Table 
2 lists the results of a question concerning the significant 
attributes of the Japanese character. Respondents were 
asked in a face-to-face interview to choose as many 
adjectives from among ten alternatives as they thought 
described the Japanese character. 


Table 2 
Significant Attributes of Japanese character 
Wie 1339 
(Hand card) Which of the following adjectives do you think describes 
the character of the Japanese people? Choose as many as you like. 
6 Kind 


1 Rational 18% 42% 
2 Diligent 71% 7 Original 7% 
Sires 13% 8 Polite 50% 
4 Open, frank 14% 9 Cheerful 8% 
5 Persistent 51% 10 Idealistic 23% 


The form of this question is different from that of the 
item count technique. In the item count technique, the 
respondent is asked to “answer the number of adjectives.” In 
contrast, in this survey the respondent is asked to “circle as 
many adjectives you feel are appropriate.” In addition, the 
ten items are not very sensitive, hence the respondents 
should not hesitate during the selection. However, since the 
real contingency table between each of the ten items and 
another variable Z is obtained, we can evaluate the 
performance of estimators through a pseudo item count 
procedure. 

We took each of the following three items as the key 
item Y, where Y =1 implies that the item was selected. 


_ 7 Original (1 is the least among the ten items) 
8 Polite (7 is just 50%) 


- 2 Diligent (7 is the largest among the ten items) 
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Three combinations of non-key items are used, as listed 
in Table 3. Combination | comprises two items with low 
proportions, while combination 2 comprises two items with 
high proportions. Combination 3 is the case with the 
maximum number of non-key items. 


Table 3 
Three Combinations of Non-key Items 


Non-key items 


Combination 1 (G = 2): 9 Cheerful (8%) 
3 Free (13%) 
Combination 2 (G = 2): 5 Persistent (51%) 
6 Kind (42%) 


Combination 3 (G = 9): Nine items other than the key item 


We used either gender or age as the domain variable Z. 
Gender is either male or female, and the age categories are 
“20 — 29,” “30 — 39,” “40 — 49,” “50 — 59,” “60 — 69”, and 
“70 and over.” 


3.2 Direct Questioning Versus Item Count 
Technique 


3.2.1 Simulation Methods 


First, we compare the standard errors between the direct 
questioning and the item count techniques. In this 
experiment, we attempted one combination of “7 Original” 
(key item), combination 3 (non-key items), and gender 
(domain variable). The contingency table based on the entire 
sample of N =1,339 is listed in Table 4. 


Table 4 
A Contingency Table Between “7 Original” and Gender 
7 Original 
al 10) Total 
Male 46 (7.5) 56955925) 615 (100.0) 
Female 51 (7.0) 673 (93.0) 724 (100.0) 
Total O72) cada (9228) 1,339 (100.0) 


The simulation was conducted through the following 
procedures: 


Step 1. Suppose the total sample of N =1,339 to be a 
population. 


Step 2. Draw a subsample S of size N f where f is a 
sampling fraction with a simple random sampling 
without replacement. 


Step 3. As the simulated result of the direct questioning 
method, compute the proportion directly, 
p(Y =1|Z =male) and p(Y =1|Z = female). 


Step 4. Divide the subsample S into two groups S“ and 
S® of size n* and n” that are not necessarily of 
equal size. Count the number C“ of selected non- 
key items for each respondent in $ 4” Also, count 
the number C” of selected items including both 
the key item and the non-key items in S$”. 
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Step 5. As the simulated result of the item count technique, 


compute p(Y=1|Z=male) 


p(Y=1|Z=female) 


and via the three estimation methods; stratified 
method, cross-based method, and double cross- 
based method. In the double cross-based method, we 
let w4 =n4 /(n4 +n*) and w? =n? (n4 +n’). 


Step 6. 


for 2,000 iterations. 


We let f =0.1 in step 2 and perform steps 2 to 5 
Calculate the 


means 


Ey, E;,Ec¢, and Ey, and the standard deviations 


SE,,SEs,SEc, and SEy 


of each estimation 


method to approximate the expectations and the 


standard errors of the estimators, 


where the 


subscripts D, S, C, and W, indicate the direct 
questioning method, the stratified method, the 
cross-based method, and the double cross-based 
method, respectively. In the same way, we let 
f =0.2 and perform steps 2 to 5 for 2,000 
iterations, and so on up to and including f =0.9. 


Expectations (Male) 
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3.2.2 Simulation Results 


Figure 1 shows the approximated expectations and 
standard errors of the estimators. The horizontal axes 
indicate sampling fraction f. In both the cases, male and 
female, the approximated expectations of E,, are stable at 
every f-value while E,,F,, and E,, of the item count 
technique fluctuate irregularly. This is because randomness 
is introduced twice under the item count, ie., in the 
sampling phase and in the division phase, whereas 
randomness is introduced only in the sampling phase under 
the direct questioning scenario. Even if f =1, the estimator 
under the item count technique has a certain amount of 
variance due to the randomness at the division phase. As the 
range of fluctuation was negligible compared to the 
magnitude of the standard errors, which are referred to 
below, we concluded that the number of repetition was 
sufficient. 


Standard Errors (Male) 
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Figure 1. Approximated Expectations and Standard Errors of Estimators. 
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The standard errors, SE,, of the direct questioning 
method is considerably small compared to those of the item 
count. In the case of the item count, standard errors do not 
converge to zero even if f =1. As noted above, this is 
because the randomness is also introduced in the division 
phase. The standard errors of the stratified method are 
obviously larger than those of the two cross-based methods. 
The lines indicating the results for the cross-based method 
and the double cross-based method almost overlap, and 
appear to have no outstanding differences. 

In order to evaluate the amount of variances or standard 
errors of estimators, let us consider the following indices 
that are analogous to the design effect (Kish 1965), 


SE, 
Se 


Defy, vu, = 


where M, and M, indicate one of the four methods D, S, 
C, and W. Although we have omitted the detailed results, 
roughly summarized, Def. , ranges from 50 (when 
f =0.1) to 700 (when f =0.9 ). That is, even if we use the 
cross-based method, the standard errors of the item count 
inflate nearly seven- to twenty-six-fold as compared to the 
direct questioning. However, the variance reduction attained 
by using the double cross-based method instead of the 
stratified method ranges from Defy ; =0.39 (male) to 0.55 
(female). In other words, the standard errors of the double 
cross-based method are reduced to about 62 percent of the 
stratified estimate at the minimum, and 74 percent at the 
maximum. 


3.3 Stratified Versus Cross-based Method 
3.3.1 Simulation Methods 


In the previous section, the precision of the cross-based 
and the double cross-based method appeared to be larger 
than those of the stratified method. We shall check the 
precision of these methods for other combinations of the key 
item, the combination of non-key items, and the domain 
variable Z by simulation experiments. 

In this section, we used all samples as follows: 


Step 1. Compute P(Y =1|Z =z) for each z based on all 
data of size N = 1,339. 


Step 2. Divide the total sample (NV = 1,339) randomly into 
group A and group B of size n* and n® where 
N=n‘ +n?. 

Step 3. Count the number C% of selected non-key items 
for each respondent of group A, and count the 
number C” of selected items, including both the 
key item and non-key items, in group B. 
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Step 4. Estimate p(Y =1|Z =z) by the stratified method, 
the cross-based method, and the double cross-based 
method, respectively. 

Step 5. Compute the chi-squared distance e* between 
PYHH1|Z =z) “and” pW =1Z=Ayior each 
method. 


o? = (pW = NZ = 2) = PY NZ = 
PY =1|Z =z) 


Step 6. Repeat the above procedure from step 2 through 
step 5 for 1,000 iterations. Calculate the means and 
the standard deviations of e* for each method. 


In addition, we simulated the stratified method under the 
randomized response for references via the following 
procedure: 


Step 1. Let p be a proportion as described below. Divide 
the total sample (NV =1,339) randomly into two 
groups. Group A is composed of Np respondents, 
and group B is composed of N(1— p) respondents. 


Step 2. Let n i be the number of respondents who selected 
the key item and Z = z in group A. Let ne be the 
number of respondents who did not select the key 
item and Z =z in group B. Let n, be the number 
of respondents with Z = z. Compute 


Sz A B 
BY AZ 2 2 ee 


Step 3. Calculate e* employing the same equation as used 
in the item count technique. 


Step 4. Repeat the above procedure from step | through 
step 3 for 1,000 iterations. Calculate the means and 
the standard deviations of e? for each method. 


We used three p values; p = 0.2, p=0.3, and p=0.4. 


3.3.2 Simulation Results 


Table 5 and Table 6 list the means and the standard 
deviations of 1,000 e’s for the domain variable Z of gender 
and age, respectively. A smaller mean of “e*—value” indi- 
cates that the domain estimators are more precise. In some 
repetitions, illogical estimates p(¥Y =1|Z=z), which 
deviate from the range [0, 1], were obtained. The columns 
of the tables denoted by “under” indicate the number of 
repetitions when at least one of the estimates 
p(Y =1|Z =z) was under 0, and “over” indicates that the 
estimates were over 1. Ideally, the figures of the columns of 
“logical p” should be 0. 
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Table 5 
Means and Standard Deviations of e?s and Number of Times Illogical Estimates were Obtained (Domain Variable Z is Gender) 
7 Original (7%) 8 Polite (50%) 2 Diligent (71%) 
e” —value illogical p e* —value illogical p e” —value illogical p 
mean (s.d.) under —_ over mean (s.d.) under over mean (s.d.) under over 
Stratified method 
Combination 1 38 (36) 39 0 6 (6) 0 0 4 (4) 0 0 
Combination 2 89 (92) 179 0 16 (17) 0 0 10 (11) 0 0 
Combination 3 341 (330) 457 0 44 (43) 0 0 33 (32) 0 a 
Cross-based method 
Combination | 18 (24) 1 0 4 (5) 0 0 3 (3) 0 0 
Combination 2 45 (65) 4] 0 10 (12) 0 0 a) (8) 0 0 
Combination 3 163 (239) 186 0 22 (31) 0 0 17 (23) 0 1 
Double cross-based method 
Combination | 18 (24) 1 0 3 (4) 0 0 a (3) 0 0 
Combination 2 45 (65) 31 0 (12) 0 0 6 (8) 0 0 
Combination 3 163 (240) jg 0 21 (31) 0 0 16 (23) 0 0 
Randomized response 
p=02 2 (14) 0 0 3 (3) 0 0 2 (2) 0 0 
p=03 35 (43) 4] 0 8 (7) 0 0 > (5) 0 0 
p=04 158 (181) 305 0 Sb (34) 0 0 23 (23) 0 3 
Note: e” —value is multiplied by 10°. 
Table 6 
Means and Standard Deviations of e?s and Number of Times Illogical Estimates were Obtained (Domain Variable Z is age) 
7 Original (7%) 8 Polite (50%) 2 Diligent (71%) 
e* —value illogical p e” —value illogical p e* —value illogical p 
mean (s.d.) under _—_— over mean (s.d.) under — over mean _(s.d.) under _—_ over 
Stratified method 
Combination 1| 375 (226) 609 0 60 (39) 0 0 39 (26) 0 0 
Combination 2 859 (507) 799 0 152 (91) 0 0 97 (58) 0 18 
Combination3 3,410 (2,108) 926 1 446 = (290) 48 4] 333 (217) 9 353 
Cross-based method 
Combination | 93 (82) 8 0 52 (20) 0 0 28 (16) 0 0 
Combination 2 175 (195) 138 0 80 (42) 0 0 59 (33) 0 0 
Combination 3 536 (733) 273 0 89 (95) 0 0 70 (71) 0 10 
Double cross-based method 
Combination 1 70 (75) 8 0 13 (13) 0 0 9 (8) 0 0 
Combination 2 153 (202) 93 0 45 (35) 0 0 31 (23) 0 0 
Combination 3 526 (745) 246 0 2 (94) 0 0 ay (70) 0 1 
Randomized response 
p=02 158 (101) 284 0 2D (14) 0 0 17 (11) 0 0 
D=03 476 (294) 720 0 74 (42) 0 0 51 (31) 0 2 
p=04 2,181 (1,348) 945 0 B35. . (193) 9 9 232 (136) 0 217 
Note: e*—value is multiplied by 10°. 
For every combination of the key item, the non-key used. For example, when combining “7 Original,” 


items, and the domain variable Z, the means of e° of the 
double cross-based method are the smallest, and the cross- 
based method is the second smallest by a narrow margin. 
When 1 of the key item is low (“7 Original’), the number 
of non-key items is large (combination 3), and the number 
of alternatives of the domain variable Z is large (age), the 
accuracy of the stratified method decreases greatly 
compared to other combinations. 

Moreover, when mt of the key item is low, negative 
estimates are often observed when the stratified method is 


combination 3 and age, the frequency of observed negative 
estimates is 926 out of 1,000 iterations. When the double 
cross-based method is used, the negative estimates are less 
likely to be observed. 

For randomized response, when the number of 
alternatives of the domain variable Z is small (gender), the 
accuracy of the estimates seems to be the same as the cross- 
based and the double cross-based methods. However, the 
mean e° is somewhat larger than that of the cross-based 
method when the domain variable Z has many options (age). 
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The randomized response, for which only the stratified 
method is available, also suffers from negative estimates, 
particularly when 7 is small (“7 Original’). 


4. Conclusion 


The following results were obtained through simulation 
experiments: 


— The cross-based method or the double cross-based 
method, which is proposed in this article, should be 
used to estimate domain parameters when the data 
is obtained via the item count technique. In the first 
simulation, the variances of cross-based estimators 
were reduced to 39 percent of the variance of the 
stratified estimate at the minimum to 55 percent at 
the maximum. In the simulation studies, the double 
cross-based method made no drastic improvement 
in precision as compared to the cross-based 
method. 


— Even when the double cross-based method is used, 
the standard errors of the domain estimators are 
much larger than those of the direct questioning 
technique. 


The true m=Y=P(Y=1) of a question, to which 
respondents evade giving a truthful answer, would be often 
small. In addition, an indirect questioning technique is used 
in order to ensure protection of privacy. The respondents 
feel that their privacy is secured when many non-key items 
are included (Hubbard et al. 1989). The simulation studies 
show that in such situations, the cross-based method or 
double cross-based method is more efficient than the 
traditional stratified method. 

The domain estimators obtained by the traditional 
stratified method are generally inconsistent with the 
estimator 7 as shown in (10). Poststratified estimator 7p. 
by the domain variable addressed is essential in order to 
ensure consistency. Alternatively, we have to divide the 
total sample into two subgroups so that the distributions of 
their domain variable match in advance. On the contrary, 
the domain estimators obtained by the cross-based and the 
double cross-based methods are consistent with ft as shown 
in (21). However, it does not mean that the cross-based 
method automatically adjusts the two subgroups so that the 
sample distributions of the domain variable match between 
the two subgroups. For the cross-based method, post- 
stratification by the domain variables or other demographic 
variables is also admissible, but not indispensable. 

Even when the double cross-based method is used, 
negative domain estimates are sometimes observed. It is 
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possible to avoid negative estimates by letting a negative 
estimate g. of Q. in (23) be zero. However, such an 
adjustment produces a positive bias in p(Y =1| Z = z). 

The data of the survey of the Japanese national character, 
which were used in the simulation experiments, are neither 
sensitive nor were they obtained via the item count 
technique. In the future, the performance of the proposed 
method should be assessed by applying it to data obtained 
via the item count technique. 
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Editing Systematic Unity Measure Errors Through Mixture Modelling 


Marco Di Zio, Ugo Guarnera and Orietta Luzi ' 


Abstract 


In Official Statistics, data editing process plays an important role in terms of timeliness, data accuracy, and survey costs. 
Techniques introduced to identify and eliminate errors from data are essentially required to consider all of these aspects 
simultaneously. Among others, a frequent and pervasive systematic error appearing in surveys collecting numerical data, is 
the unity measure error. It highly affects timeliness, data accuracy and costs of the editing and imputation phase. In this 
paper we propose a probabilistic formalisation of the problem based on finite mixture models. This setting allows us to deal 
with the problem in a multivariate context, and provides also a number of useful diagnostics for prioritising cases to be more 
deeply investigated through a clerical review. Prioritising units is important in order to increase data accuracy while 
avoiding waste of time due to the follow up of non-really critical units. 


Key Words: Editing; Random error; Systematic error; Selective editing; Model-based cluster analysis. 


1. Introduction 


Elements determining the quality of an Editing and 
Imputation (E&I) process are various and have been widely 
discussed in literature (Granquist 1995). We deal with a 
particular non-sampling error that highly affects two main 
competing quality dimensions: timeliness and data accura- 
cy. As far as accuracy is concerned, we adopt the definition 
suggested in the Encyclopedia of Statistical Sciences, 
(1999): “accuracy concerns the agreement between statistics 
and target characteristics”. A number of factors can cause 
inaccuracy along the overall statistical survey process. 
Inaccuracy can be reduced during the E&I phase, which can 
be viewed as an “accuracy improvement tool by which 
erroneous or highly suspect data are found, and if necessary 
corrected (imputed)’ (Federal Committee on Statistical 
Methodology 1990). 

Due to the complexity of investigated phenomena and 
the existence of several types of non-sampling errors the 
E&I phase can be a very complex and time consuming task 
(Granquist 1996). In the specialised literature a common 
error classification leads to define two different error 
typologies: systematic error and random error. The former 
relates to errors which go in the same direction and lead to a 
bias in statistics, while the latter refers to errors which 
spread randomly around zero and affect the variance of 
estimates (Encyclopedia of Statistical Sciences 1999). Un- 
derstanding nature of errors is not only useful in order to 
identify their source and to assess their effects on estimates, 
but also to adopt the most appropriate methodology to deal 
with them (Di Zio and Luzi 2002). While the Fellegi—Holt 
approach (Fellegi and Holt 1976) is a well-established 
paradigm to deal with random errors, systematic errors are 
generally treated by means of ad hoc solutions (see for 


instance Euredit 2003, Vol. 1, Chapter 5). Systematic errors 
are generally treated before dealing with random errors, 
particularly when the latter are tackled through automatic 
software, like for instance the Generalised Editing and 
Imputation System (GEIS) (Kovar, Mac Millan and 
Whitridge 1988) and more recently De Waal (2003). 

In the family of systematic errors, one that has a high 
impact on final estimates and that frequently affects data in 
statistical surveys measuring quantitative characteristics 
(e.g., business surveys) is the unity measure error times a 
constant factor (e.g., 100 or 1,000). This error is due to the 
erroneous choice, by some respondents, of the unity 
measure in reporting the amount of some questionnaire 
items. 

As real examples of surveys affected by this type of 
error, we selected two ISTAT investigations: the 1997 
Italian Labour Cost Survey (LCS) and the 1999 Italian 
Water Survey System (WSS). 

The LCS is a periodic sample survey that collects 
information on employment, worked hours, wages and 
salaries and labour cost on about 12,000 enterprises with 
more than 10 employees. In Figure | the logarithm of 
Labour Cost (LCOST), Number of Employees 
(LEMPLOY), Worked Hours (LWORKEDH) are repre- 
sented in a scatter plot matrix. Note that the employment 
variable at this editing stage is error free because of a 
preliminary check with respect to information from business 
registers (Cirianni, Di Zio, Luzi and Seeber 2000). The 
analysis of Figure | shows that Labour Cost is affected by 
two types of unity measure error (i.e., 1 million and 1,000 
factor), while Worked Hours exhibits only the 1,000 factor 
error. These errors cause the different clusters in Figure 1. 
Note that the clusters in the low left corners of each scatter 
plot represent non-erroneous data. 
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LCOSTS 


Figure 1. 


Multiple scatter plot between total labour cost, 
employees, worked hours (logarithmic scale). 


The WSS example will be described in detail in 
subsection 4.2 where an application of the method proposed 
in this paper for identifying and treating the unity measure 
error will be presented. 

For the unity measure error, the critical point is the 
localisation of items in error rather than their treatment. In 
fact, once an item is classified as erroneous, the optimal 
treatment is uniquely determined and consists in a 
deterministic action recovering the original value through an 
inverse action (e.g., division by 1,000) neutralising the error 
effect. 

The unity measure error is generally tackled through ad 
hoc procedures using essentially graphical representations 
of marginal or bivariate distributions, and ratio edits. A ratio 
edit is a rule stating that the value of a ratio between two 
variables must lie within a predefined interval. The interval 
bounds are generally determined through a priori knowledge 
or via exploratory data analysis, possibly using reliable 
auxiliary information. For this type of error, ratio edits are 
effective when one of the two variables is error free. Fur- 
thermore ratio edits allow taking into account only bivariate 
relationships between variables and even using interactive 
graphical inspection (e.g., scatter plot matrix), no more than 
a pairwise analysis can be performed, disregarding more 
complex interactions between variables. Finally, we notice 
that adopting pairwise analyses implies that variables are to 
be treated in a pre-defined hierarchy, thus increasing the 
complexity of the error localisation procedure. 

With traditional approaches, the error localisation prob- 
lem is not only complex, but also time and cost consuming. 
Time and cost are mainly affected by: 1) the complexity of 
designing and implementing automatic deterministic ad hoc 
procedures, and 2) the resources spent in manually editing 
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observations having low probabilities of being in error 
and/or low impact on target estimates (over-editing). 

In this paper we propose a probabilistic formalisation of 
the problem through finite mixture models (McLachlan and 
Basford 1988; McLachlan and Peel 2000). 

This modelling can provide a principled statistical 
approach, allowing an estimate of the conditional probabil- 
ity that an observation be affected by unity measure error. 
The advantage of the proposed approach is that it represents 
a general method allowing a multivariate data analysis, and 
providing elements that can be used to optimise the balance 
between the automatic and interactive components of the 
editing procedure, i.e., between time and accuracy 
(Granquist and Kovar 1997). 

This work is organised as follows. In section 2 the 
proposed model is introduced together with the EM 
algorithm for the estimates of the model parameters. In 
section 3 diagnostics for selective editing are described. In 
section 4 the results of the application of the proposed 
method to both simulated and real data are illustrated. 
Finally, in section 5 concluding remarks and future research 
are outlined. 


2. The Model 


It is hard to give a comprehensive formalisation of 
random and systematic errors. In this context, we provide a 
definition that, though not exhaustive, includes many com- 
mon situations. Let X° be the vector of the survey target 
variables, and (uy, >) the corresponding mean vector and 
covariance matrix. Let us suppose that the measurement 
process is affected by a random error mechanism R having 
impact on the covariance structure of X° but leaving the 
mean vector unchanged, and consequently let X be the 
corresponding “contaminated” variable, with E(X)= 
E(X°)= p, Var(X)= >. Also, we assume that X can 
in turn be affected by a systematic error mechanism S acting 
only on its expected value: 1 —*>@() for some function 
© (e.g., if an additive error mechanism is assumed, (1) = 
t+ constant). As a consequence of the two error mecha- 
nisms, assumed to be independent of one another, observed 
data can be described by a random vector Y whose 
distribution, conditional on X, depends only on the 
systematic error mechanism. Our approach to the treatment 
of systematic errors consists of building up a model for Y 
focusing only on the detection of systematic errors, thus 
aiming at recovering the randomly contaminated data 
represented by the random vector X. This is the approach 
generally adopted in editing procedures, where systematic 
errors and random errors are dealt with separately and 
hierarchically. 
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The previous definition of systematic error includes unity 
measure error, once data have been transformed in loga- 
rithmic scale. In fact, unity measure error generally acts 
multiplying variables by a constant factor. Hence data in 
error appear in log-scale as translated by a vector of 
constants, that depends on which items are in error (“error 
pattern’), while the covariance structure is the same for each 
error pattern. Moreover, as matter of fact, in business sur- 
veys variables are frequently considered log-normal. Thus 
in logarithmic scale the Gaussian setting can be adopted. 

Following the formalisation so far introduced, our goal 
becomes to assign each single observation to a specific 
“error pattern’, that corresponds to localise items in error. If 
we interpret each single error pattern as a “‘cluster’’, the error 
localisation problem is transformed in a cluster analysis 
problem, and we can exploit experiences from the model- 
based cluster analysis theory (Fraley and Raftery 2002). 

More in detail, let us suppose we have n independent 


observations Y; =(Y;,, .... Y,),i=1, .... m, corresponding 
to the q—dimensional vectors X; =(Xj,, ..., X;,) with 
rete. (065, S235 Ae OF ue suche (hatiy GEAUX 2 er, X= 


(Wy, ---» He) =H, and Var(X, ..., 2. Dae 

Based on the assumption that systematic errors affect the 
random vector X only by transforming its expected value 
H into @,(m), where @,():R* = R®, for go1, on) ht, 
are a set of known functions, the functions © 3 characterise 
univocally h/ distinct clusters (error patterns), differing each 
other only on the location parameter. For instance, if the 
systematic error possibly affects all the variables X, for 
s=l1,...,g, in the same manner by transforming their 
expected values , according to u, >u,+C, where C is 
a known constant, the number of clusters will be h= 2%, 
i.e, the number of different combinations of error 
occurrence on the g_ variables (including the case of no 
error). In this case, each function © , and each 
corresponding cluster, is associated with one of the 27 
possible sub-sets of variables affected by the error; e.g., the 
group G _ characterised by the mean vector UH, = 
(Hy, Ho +C, 3, Hy, +> H,), iS a cluster of units with 
error affecting only the variable X,. We remark that we 
assume a common covariance matrix because we make the 
hypothesis that the possible random error acts in the same 
way on all the data. 

For the error localisation purpose we follow a model- 
based approach based on finite mixture models, where each 
mixture component Ce g=l,..., h, represents a single 
error pattern. Formally, we assume that Y; =(Y,,, ..., re ), 
fog—1,..., 7, are tid wit. ,41,7,(30,), where 
>,%, =1 and mz, 20. The mixing parameters 7, represent 
the probability that an observation belongs to the t™ 
mixture component. 
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In order to classify an observation y, in one of the h 
groups, we compute the posterior probability 
T,(y;; 9, 1) = pr(i" observation € G, | y,;0, 1), that is 


h 
Toy. Or R= Way (yi OF) >, f,(y;; 8, ) 
t= 
Ome e J iD) 
The i” observation is assigned to the cluster G,, if 


T,(y;; 8, m) >t, (y;; 9, %) g=l, ..., h; g #t. 


The previous allocation rule is the optimal solution for 
the classification problem, in the sense that it minimises the 
overall error rate (Anderson 1984, Chapter 6 ). 

Since, in place of the parameters (0, 1), generally 
unknown, we use the maximum likelihood estimates 
(6, Tt), the classification rule becomes: 


1;(¥;30, 1) >t, (9,38, f) g=1, hog et. (2) 


We assume that the f(y; 8,) is a multivariate normal 
density MN(p,, 2) and that each function @,(-) acts on 
the mean vector p as a translation: @,(u)=ph+C,, where 
C,, represents the translation vector for the mean of the g : 
cluster, and it is supposed to be known. This setting, as 
already noticed, is suitable for dealing with unity measure 
error. In order to compute the likelihood estimates, we use 
the EM algorithm as suggested in McLachlan and Basford 
(1988). Nevertheless, an additional effort is necessary to 
adapt the algorithm to our particular situation, where the 
mean vectors of the mixture components are linked by a 
known functional relationship. Thus, while in the non- 
constrained case (McLachlan and Basford 1988) a different 
mean vector has to be estimated for each mixture 
component, in our constrained situation only one mean 
vector needs to be estimated. The resulting modified EM 
algorithm consists of defining some initial guess for the 


parameters to be estimated i oo i Pee Oe 
(fp, ©) and applying until convergence the following 


recursive scheme: 
ARO? k 
probabilities 7 us es 


i) compute the 
(k) 


posterior 
Te (y,;9, 7) under the current estimates 7 


i, >” (kis the index referring to the k" cycle) 
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ii) calculate the new estimates by the following recursive 
equations: 


h h 
A(k+1) _ az (k) ren a (k+l) 
H rs pe é Voi Ji /n DC ,h; 


(y, 5] wey, > a? ane 


We remark that fi,” stands for fi” +C,. 

In practical applications, it turns out that a crucial role is 
played by the choice of starting points, as usual in the EM 
algorithms (see Biernacki, Celeux and Govaert 2003). To 
overcome this problem, we use an initialisation strategy, 
following Biernacki et al. (2003), consisting of several short 
runs in terms of number of iterations, of the algorithm from 
random initialisations followed by a long run of EM from 
the solution maximising the observed log-likelihood. 

It is worth to mention that, due to the location constraints, 
the parameters to be estimated are sensibly fewer than those 
in a usual mixture problem. Actually the higher is the 
number of variables analysed the bigger is this difference; 
for instance in the case of three variables and 8 clusters we 
need to estimate 16 parameters instead of 37. This aspect is 
particularly important when we deal with small samples. 
Moreover, constraints on cluster locations make easier to 
identify “rare clusters”. In fact, being the relative distances 
between mean vectors fixed, the estimation problem reduces 
to estimate the location of the convex polyhedron whose 
vertices are the cluster centroids. In other words, since the 
location of one centroid univocally determines the positions 
of all the others, small cluster parameters are more easily 
estimated than if they were not constrained. 

Since the introduced modelling is based on _ the 
assumption that observations are normally distributed, 
model validation is an issue to take into account. The 
problem of assessing normality in mixture models is well 
described in McLachlan and Basford (1988). It is essentially 
based on the quantities d@,, described in the following. Let 
Y i fOPsi =, ae mh ‘ be the observations assigned to the 
g' cluster for g=1, ..., h, according to the estimated 
model. Let p,, be the value calculated using the estimated 
parameters, following the formula: 


(vin, [P| yf, 


(v+q)(m, -D=AD| yyy 


where D(-,-;M) is the Mahalanobis squared distance 
based on the metric M, and v=n—h-—g. We define Gai 


Dir (3) 


Statistics Canada, Catalogue No. 12-001-XPB 


as the area to the right of the p,, value under the F,, 
distribution (for details see McLachlan and Basford 1988, 
Chapter 2). 

Under the normality assumption, 4 ei fori Linea ! is 
approximately uniformly distributed on (0,1). Hawkins 
(1981) suggests using the Anderson—Darling statistic for 
assessing the uniform distribution of a,,. The d,, are also 
useful to detect outliers, i.e., atypical observations with 
respect to the model. In McLachlan and Basford (1988) the 
lower is G,, the higher is the probability of y,; of being 
atypical, thus all observations with 4 xa where is a 
specified threshold, can be considered as atypical. 
Suggested threshold levels range from M@=0.05 to 
a«=0.005, depending on which outlying observations 


(more or less extreme values) are to be selected. 


3. Diagnostics for Selective Editing 


Once the parameters of the mixture have been estimated, 
we are able to classify data into the different clusters, i.e., 
for each observation we can assess whether it is in error or 
not, and which variables are in error. However, different 
types of critical observations can be identified after the 
modelling phase: units classified in a cluster, but having a 
non-negligible probability of belonging to another cluster, 
and observations that are outliers with respect to the model. 

In order to increase data accuracy it would be useful to 
make a double check on critical observations (through either 
a clerical review or, in the most difficult cases, a follow-up). 
On the other hand, in order to reduce possible over-editing 
and editing costs, the manual review and/or follow up 
should be concentrated on the most critical observations. 
The proposed mixture model directly provides diagnostics 
that can be used to this aim. 

A first type of critical units is represented by possibly 
misclassified observations. In order to measure the degree of 
belief in the class assigned to an observation y, we can 
consider the corresponding probability resulting from (2). 
Observations, for which this probability is not very close to 
one, have a non-negligible probability to belong to another 
cluster. These observations are those in the region where the 
mixture components overlap each other. 

In addition to the previous type of critical units, there are 
other observations that are far from all the clusters (all the 
mixture components), i.e., outliers with respect to the 
model. Also these observations represent critical situations. 
In order to identify this kind of outlier we refer to the 
quantities @,, described in the previous section. 

Classification probability and atypicality index 4, 
should be used, according to a selective/significance editing 
approach (Latouche and Berthelot 1992; Lawrence and 
McKenzie 2000), to build up appropriate score functions to 
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prioritise critical units. An example of how to use these 
diagnostics to this aim is given in subsection 4.2. 


4. Illustrative Examples 


In this section some experiments carried out in order to 
investigate the peculiarities of the proposed method are 
presented. Firstly, through a simulation study, we analyse 
the performance of the proposed model when applied to 
data that depart from normality. Secondly, through an 
application on real data, we describe how this approach can 
be applied in Official Statistics. 

All the experiments are performed using the R environ- 
ment for statistical computing (http://www.r-project.org/). 


4.1 Simulated Example: Departure from Normality 


In this experiment we describe the results obtained by 
applying the mixture approach to the three different 
populations depicted in the first line of Figure 2. The first 
distribution is a bivariate normal (MN), hence it represents 
the case when the model is correctly specified. The second 
one corresponds to a bivariate ¢ distribution (MT), i.e., it 
mimes the situation when the departure from normality is 
essentially in having heavier tails. The last one is a bivariate 
skew-t distribution (ST) (Azzalini and Capitanio 2003, 
Azzalini, Dal Cappello and Kotz 2003), and it represents a 


Sf 


population distributed according to an asymmetric distri- 
bution with heavy tails. 

From these distributions we build a four components 
mixture model by adding to each unit one of the four 
translation vectors C,=(0,0), CC, =(0, log(1,000)), 
C, =(log(1,000), 0), C, =(log(1,000), log(d,000)) with 
probabilities m,=0.5, a, =0.1, m,=0.1, and 7, =0.3 
respectively. These parameters represent the mixing 
proportions of the mixture model and refer respectively to 
the probabilities of no translation in the variables, translation 
in only one of the two variables, and translation in both 
variables. From each mixture, we draw 100 samples of 
1,000 observations. In the second line of Figure 2, we report 
one of these samples (MN-Mixt, MT-—Mixt, ST-Mixt), 
corresponding to the three different populations MN, MT, 
ST respectively. 

For each sample, we compute the number of correct 
classifications obtained by using the mixture approach 
described in section 2. The mean number of correct 
classifications over the 100 samples is reported in Table 1. 

As it can be seen in Table 1, the frequency of correct 
classifications decreases with the departure from normality. 
However it seems acceptable also in the critical case ST, 
where the population is characterised by both asymmetry 
and heavy tails. 
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Figure 2. Contour plots of the three bivariate distributions multinormal (MN), t—student 
(MT), skew—t (ST), and scatter plot of the corresponding mixtures MN—Mixt, 


MT-Mixt, ST—Mixt. 
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Table 1 
Frequency of Correct Classifications 


MN MT _ ST 
% correctly classified 98.5 97.5 95.6 


As discussed in section 3, the mixture approach provides 
elements (such as the degree of atypicality and the 
classification probability) that can be used in order to 
prioritise units to be clerically reviewed. Therefore, an 
overall assessment of the procedure should consider also the 
results obtained through a selective editing approach based 
on these model diagnostics. 

In order to analyse the characteristics of atypicality index 
and classification probability, we examine a single sample 
of 1,000 observations drawn from the three populations so 
far introduced. In Figure 3, the three samples MN-Mixt(a), 
MT-Mixt(a), ST—Mixt(a) are represented, furthermore the 
misclassified units are depicted with a cross in the same 
graph. The number of misclassified units is 19 for 
MN-Mixt, 20 for MT-Mixt, and 36 for ST—Mixt. 

On this sample, we focus on the impact of different 
threshold levels both for atypicality (a) and classification 
probability (B). For each threshold, we report in Table 2 
and Table 3 the number of units below that threshold, i.e., 
the number of critical observations (N. Atyp, N. Pr. Class), 
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and among them the number of misclassified units (Atyp - 
Misclas, Pr. Class - Misclas). 

As far as atypicality is concerned, we note that when the 
model is correctly specified, the importance of the 
atypicality index in recovering misclassified units is 
negligible, while the classification probabilities are more 
effective. On the other hand the degree of atypicality is 
important when the model departs from normality. It is clear 
that the number of observations selected for a given 
combination of thresholds a and B is not equal to the sum 
of the frequencies obtained in Table 2 and Table 3. Thus, in 
order to evaluate the joint impact of these two indices we 
choose the two following thresholds a=0.005 and 
B =0.975. We report in Figure 3 (second line) the units 
selected only for the atypicality value (squares), only for the 
classification probability (triangles), and for both of them 
(crosses). From these figures we see how the impact of 
atypicality is mainly on outliers identification while the 
classification probability works on the overlapping regions. 
In Table 4 the number of selected units and, out of them the 
number of misclassified units are shown. 

We note that for population MN-Mixt, apart one 
observation, all the misclassified units are selected. For 
MT-Mixt, we are able to select 14 out of the 20 
misclassified units, and in the most critical sample ST—Mixt 
we select 24 out of the 36 misclassified units. 
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Figure 3. Misclassified units (crosses) in MN—Mixt(a), MT—Mixt(a), ST—Mixt(a). Critical 
units for atypicality (square), for classification probability (triangle), and for both 
of them (cross), in MN—Mixt(b), MT—Mixt(b), ST—Mixt(b). 
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Table 2 
Number of Critical Observations and Misclassified Units with Respect to Three Different Thresholds for Atypicality 
MN-Mixt MT-Mixt ST-—Mixt 
a N. Atyp Atyp — Misclas N. Atyp Atyp — Misclas N. Atyp Atyp — Misclas 
0.05 50 1 84 9 68 14 
0.01 15 0 50 7 bie, 8 
0.005 8 0 39 i 20 5 
0.001 4 0 25 4 14 Z 
Table 3 
Number of Critical Observations and Misclassified Units with Respect to Three Different Thresholds for Classification Probability 
MN-Mixt MT-Mixt ST-Mixt 
B N. Pr. Class Pr. Class—Misclas N. Pr. Class Pr. Class —Misclas N. Pr. Class Pr. Class — Misclas 
0.99 Ly she 63 12 182 26 
0.975 76 18 46 11 82 26 
0.95 a) 14 35 9 66 21 
Table 4 
Number of Critical Observations and Misclassified Units with Respect to Atypicality and Classification Probability 
MN-Mixt MT—Mixt ST-Mixt 
Thresholds N.Crit. Units _N. Misclas __N.Crit. Units N. Misclas N.Crit. Units N. Misclas 
oO =0,005, B= 0:975 84 18 79 14 98 24 


4.2 An Application to Real Data: The 1999 Italian 
Water Survey System 


In this section we describe an application of the mixture 
model approach to real survey data. The data are taken from 
the 1999 Italian Water Survey System (WSS). The WSS is a 
census that collects information on water abstraction, supply 
and usage for the 8,100 Italian municipalities. We restrict 
our analysis to the municipalities belonging to one of the 
data domains defined by altimetry (2,041 observations) and 
to the main variables Total Invoiced Water (TI) and Total 
Supplied Water (TS). Both these variables refer to water 
volumes and the respondents are requested to provide them 
in thousands of cubic meters. The scatter plot on log-scale 
of per capita water invoiced (WI) versus per capita water 
supplied (WS) (Figure 4) shows the presence of four clusters 
corresponding to unity measure error in one, both, or none 
of the target variables. This is probably due to the 
misunderstanding of some respondents that expressed water 
volumes in litres or in cubic meters rather than thousands of 
cubic meters, as requested. As expected, the two most 
populated clusters are those corresponding to non-erroneous 
units and to units where both variables are in error. 
Nevertheless, we can note the presence of two rare clusters 
corresponding to observations where the unity measure 
error affects only 7/ or only TS respectively. 

In Table 5 a label is assigned to each group associated 
with a specific error pattern. For the sake of simplicity we 
introduce two flags Eys and Ey, assuming value 1 or 0, 


depending on whether the corresponding variables are 
affected by the unity measure error or not, respectively. 

In order to identify and correct the unity measure error 
we apply the procedure described in sections 2 and 3. We 
classify each observation according to a specific error 
pattern, i.e., we assign each unit to one of the clusters G,, 
for t=1, ..., 4. The results are reported in Table 6. 

For each unit the atypicality index is also calculated and 
the threshold «© =0.005 is chosen in order to flag atypical 
units. According to this threshold, 71 observations are 
selected as atypical, marked by “crosses” in Figure 7. Once 
the values d4,; are computed according to Formula (3), a test 
assessing the normality assumption can be performed. 
Actually, following McLachlan and Basford (1988, Chapter 
2), the Anderson-Darling test on the uniformity of a gi On 
each single estimated cluster is performed. The p—values are 
below 0.001 for the two largest clusters. Since the test is 
based on asymptotical approximations, we do not take into 
account the results on the other two rare populations. In 
Figure 5 we report the empirical sample quantiles versus the 
normal quantiles of the variables log(WI) and log(WS), 
focusing only on the subset of data classified as non- 
erroneous. We notice that departure from normality is 
mainly due to heavy tails. Based on the results obtained in 
section 4.1, where the method performed satisfactorily also 
in non-gaussian setting, we are confident about the good 
performance of the mixture approach on the survey data. 
This expected behaviour is confirmed by the application 
results showed in the following. 
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Figure 4. Scatter plot of log(WS) and log(WI). 
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Table 5 
Error Patterns and Error Labels 
Ers =i(0) Exs = 0 Exs = 1 Es = 1 
Error pattern 
- Ey =O En =l En=0 En=l 
Cluster label Gl G2 G3 G4 


Table 6 
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Figure 5. Normal qq—plot of log(WS) and log(WI). 


Number of Units Assigned to Each Cluster 


Cluster label G1 G2 
N. of units 1,800 16 
% 88.2 0.8 
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G3 
10 
0.5 


G4 
215 
10.5 


In the remaining part of this section, it is shown how the 
posterior probabilities can be used to prioritise units to be 
reviewed which are likely to provide the greatest editing 
benefit, taking into account the potential impact of the 
clerical editing on the estimates. To this aim, note that a 
wrong Classification of an observation causes that the final 
values of at least one variable differ from the corresponding 
true values by a multiplicative factor. These discrepancies 
can seriously affect the accuracy of the estimates leading to 
a strong bias. In order to select the potentially erroneous 
units that most likely have a strong impact on the target 
estimates, we follow the selective editing approach. Let 
X,, X, denote the variables TS, T/ respectively. For each 
unit u;, i=1, ..., n, and for each variable Xj, jj See 
us define: 
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:data free of systematic error; 
: observed data; 


:data after the treatment of systematic error based 
on the classification through mixture model (Z.e., 
x. y =, oF pe i =¥, /1,000 depending on the 
cluster the unit uw, 1s assigned to). 


Let us suppose that the target estimates refer to population 
totals T(X,)=2,X,,. Further, denote by E,() the 
expectation over the distribution of the random variable X , 
conditional on the observed data Y, and the data after 
correction X jj. Then, from the inequality 
PE (X , -X, DE LEX, -X, | it follows that the 
quantity on the ent hand side can eS viewed as an upper 
bound for the expected bias of the total estimate for the 
variable X, based on the corrected values x j; The last 
consideration suggests a method for selecting the most 
“influential” units with respect to the estimate T(X ,): in 
order to guarantee the requested level of accuracy and to 
minimise costs due to manual check, we define a local score 
function S,, =(E, |X, —X, |)/T(X ;), where T(X,) is 
a reference estimate for T(X ,), for instance the estimate 
from a previous survey, or a robust estimate. In our case, in 
order to robustify the preliminary estimate we first exclude 
from the data the atypical observations, then compute the 
mean value on this subset, and then multiply it by the total 
number of units. 

The local score S;, measures the impact of the potential 
unity measure error associated to the unit wu, on the target 
estimate T(X ,). Then, units can be sorted by their score 
i) i and, starting from the highest values, the first units can 
be selected until the sum of the remaining S,, values is 
lower than a predefined threshold. 

If both the variables TS and TI are considered 
simultaneously, a global score S;, for i=1, ..., n, can be 
obtained by suitably combining the local score functions 
S;, j=1, 2. Possible choices are S, =(S;,+5,.)/2, or 
S; =max ,.,,, 5;;. The latter function, for instance, ensures 
that the impact of the potential unity measure error 
associated with u, on each estimate is not greater than S,. 

In order to compute a scores S,, the conditional 
expected value E, |X, —X,,| is to be iene for each 
unit u,, i=1, ..., n, and for eh variable Xx forsyj=as 2; 
This can be easily done through the daisies probabilities. 
For instance, suppose that the unit wu, has been assigned to 
the cluster G,. This means that, for this unit, the observed 
value of TS (Y,,) has been considered correct, while the 
observed value of TI (Y,,) has been flagged as affected by 
unity measure error (i.e., multiplied by 1,000). The 
correction consists of dividing by 1,000 the observed value 
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of TI, ie. (X, =Y,, X;. =Y,,/1,000). The conditional 


expected value E, | X by | can be computed as follows: 


E,|X, -X,|=W —Y,,|Pr(u, € G, UG,) 
a, Pr(u, EG, UG, ) 
099 
4 000 al ee af ci) 
Tao Yin 
E,|X,, -X;.|= Dea Gao Pr(u,€ G, UG,) 
Veg 000 Pr(u,;€ G, UG,) 
oS) ates 
Gaaad © (7; +%;;), 
where 7, is the estimated probability that unit u, belongs 


to cluster G,. In a similar manner the score functions can 
be calculated for all the units. 

In practice, in our application we sort the units by their 
global score S;, max ,_, , S, (ascending order). Then we 
exclude from clerical review all the first observations such 
that their cumulative sum of S, is below 6, where 6 is a 
specified tolerance level for the impact on the estimates due 
to errors remaining in data. In Figure 6 the behaviour of the 
cumulative sum of S;, S(;, =, <; 5;,, 1s shown for the first 
most critical 10 observations. We remark that for the sake of 
clarity we have not reported all the observations because for 
most of them S,,, is close to zero causing an unreadable 
picture for their different magnitude. Note that a residual 
relative error less than 56=0.001 is expected by selecting 
only the first two units (drawn with crosses). 

In Figure 7 all the units selected because of their 
atypicality (71) and/or the relative impact on estimates of 
their potential errors (2) are shown: crosses correspond to 
observations that are critical for atypicality, squares indicate 
the other two types of critical units. 

A comparison with the results obtained by the official 
procedure is made. Out of the 1,968 units not selected for 
clerical review, 1,911 observations are error free or affected 
by unity measure error only. For all of them the 
classification of the mixture model is correct. Out of the 
remaining 57 units characterised by other error typologies, 
45 are classified as non-affected by the unity measure error, 
while 12 as units with the 1,000 error in both the variables. 
This last misclassification can be explained by the presence 
of another systematic error (times 100, 10,000 factors) that 
is not taken into account in the model used for this example. 
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Figure 6. Plot of the cumulative score S(;) for the first 
most critical 10 observations. 


Log(WS) 


Figure 7. Scatter plot of log(WS) vs log(WI). Crosses indicate 
critical units for atypicality, squares mark critical units 
for the impact of their potential error. 


A further comparison is about the estimate of the totals. 
Under the hypothesis that the values selected for a clerical 
review are correctly restored, the relative differences 
between the “true” total values according to the official 
procedure 7(X,) and the model estimate BOX j) as 
B(X ,)=(IT(X ,) -T(X ,))/T(X,), for j= 1,12) are 
B(X,)= 0.005 and B(X,)=0.002. These values are not 
directly comparable with the tolerance level 6=0.001, in 
fact this threshold relates only to impact of the remaining 
unity measure errors, while B(X ,) is also affected by other 
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kind of errors. Thus, for a more direct comparison, we 
replace for these units the wrong values with the “true” ones 
obtaining B(X,)=B(X,)=0. This particularly high 
performance of the model is justified by the low degree of 
overlapping of the clusters as clear in Figure 7. 


5. Final Remarks and Further Research 


In this paper we propose a finite mixture model to deal 
with a particular type of systematic error that frequently 
affects numerical continuous survey data: the unity measure 
error times a constant factor. The proposed approach has the 
advantages, with respect to the traditional ones, to formally 
state the problem in a multivariate context, to be easily 
implemented in generalised software, and to naturally 
provide useful diagnostics for prioritising doubtful units 
possibly containing influential errors. The latter character- 
istic is particularly important when the situation is critical, 
i.e., when different error patterns overlap each other or in 
other words when unity measure errors are among plausible 
observations. In these circumstances a clerical review is 
needed. Hence, it is important to optimise the selection of 
critical observations in order to save time and costs. All 
these advantages are the natural consequence of the intro- 
duction of a model-based technique. On the other hand, it is 
clear that the use of a model-based approach implies prob- 
lems related to model assumptions. However, based on the 
experiments illustrated in the paper, it seems that also in 
cases of departure from the normality assumption, the 
proposed technique performs satisfactorily. Nevertheless, it 
is worth to mention that for extreme departure from normal- 
ity, e.g., when the distribution is not unimodal, the method 
is expected to fail. This can happen in real situations when 
true data contain different clusters, for instance differences 
in men and women income might cause a_ bimodal 
distribution for the income itself. In some cases the problem 
could be overcome by stratifying data with respect to some 
explicative variables, e.g., sex in the previous example. An 
alternative approach to this specific problem could be based 
on modelling each cluster in turn as a Gaussian mixture, 
thus obtaining a “mixture of mixture models” (McLachlan 
and Peel 2000; Di Zio, Guarnera and Rocci 2004). 

Finally, a last concern is about the number of variables 
that can be treated simultaneously. Actually, the number of 
clusters and then the number of mixing parameters 7, can 
have an exponential growth with respect to the number of 
variables, making the parameter estimation a critical task. 
However it is worthwhile noting that the number of 
parameters related to the mean vector and covariance matrix 
increases much slower, due to the constraints characterising 
our model. 
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Using Matched Substitutes to Improve Imputations for Geographically 
Linked Databases 


Wai Fung Chiu, Recai M. Yucel, Elaine Zanutto and Alan M. Zaslavsky ' 


Abstract 


When administrative records are geographically linked to census block groups, local-area characteristics from the census 
can be used as contextual variables, which may be useful supplements to variables that are not directly observable from the 
administrative records. Often databases contain records that have insufficient address information to permit geographical 
links with census block groups; the contextual variables for these records are therefore unobserved. We propose a new 
method that uses information from “matched cases” and multivariate regression models to create multiple imputations for 
the unobserved variables. Our method outperformed alternative methods in simulation evaluations using census data, and 
was applied to the dataset for a study on treatment patterns for colorectal cancer patients. 


Key Words: Unit nonresponse; Multiple imputation; Contextual variables; Matched substitutes; Administrative 


records. 


1. Introduction 


In a study on treatment patterns for colorectal cancer 
patients, income and education are desired variables for 
constructing statistical models of relevant scientific interest. 
Unfortunately, individual measurements for these variables 
are not directly observable from the cancer registry 
databases that are compiled from hospital records, which 
like many administrative databases contain primarily 
information required for administrative purposes. Instead, 
mean values of these variables for small geographical areas 
(census block groups or tracts) including the subject’s area 
of residence are used as regressors to estimate income and 
education effects. Analyses using such “contextual vari- 
ables” are common in epidemiological and health services 
research (Krieger, Williams and Andmoss 1997), and often 
produce results broadly similar to those based on individual 
variables. If both individual and contextual variables were 
available, it might be possible to separate the effects of indi- 
vidual characteristics and contexts; in a purely contextual 
analysis, these effects are confounded. Nonetheless, associa- 
tions between contextual socioeconomic characteristics and 
quality of care would suggest an equity problem, regardless 
of whether such associations primarily reflect individual or 
community-level relationships. 

In the colorectal cancer treatment study, each contextual 
variable for a given patient record is assumed to be the 
variable’s census group (or tract) mean value obtained by 
geographically linking the record’s address to a census 
block group (or tract). A small but substantial percentage of 


patient records (about 3.3% or 1,696 records) have 
insufficient address information to permit links with census 
block groups, hence making the corresponding contextual 
variables unobservable. Such records will be called 
ungeocodable records, while records that can be linked to 
census block groups will be referred to as geocodable. To 
generate multiple imputations for the unobserved contextual 
variables, we propose a strategy that uses information from 
more than one “matched case” to help build parametric/ 
nonparametric imputation models. In particular, information 
from the matched cases accounts for small area effects in 
our imputation models, so that there is no need to explicitly 
model such effects. 

Rubin and Zanutto (2001) use the term “matched 
substitute” instead of “matched case”, and propose a 
parametric imputation model using only one matched 
substitute per record. The analyses resulted from their 
model were compared to those given by other analytic 
methods in an extensive simulation study, but was not 
applied to real data. We extend Rubin and Zanutto’s method 
by (1) allowing use of information from more than one 
matched case per record and (2) using an empirical rather 
than a parametric distribution of residuals. 

This research was motivated by our need for multiple 
imputations for the partially observed variables in the study 
of treatment patterns for colorectal cancer patients. Ayanian, 
Zaslavsky, Fuchs, Guadagnoli, Creech, Cress, O’Connor, 
West, Allen, Wolf and Wright (2003) analyzed a dataset 
that included imputations generated by our method, 
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referring to Rubin and Zanutto (2001) and a preliminary 
version of this paper that appeared in a proceedings 
publication (Chiu, Yucel, Zanutto and Zaslavsky 2001). 
This paper is the first comprehensive publication of our 
methodology and the first published report that describes an 
application of Rubin and Zanutto’s method to real data. 

The organization for the rest of this paper is as follows. 
Section 2 summarizes Rubin and Zanutto’s method and 
gives a general description of our method. Section 3 outlines 
the application of our method to the colorectal cancer study. 
Section 4 illustrates in a simulation study the performance 
of our method relative to three other commonly-used 
nonresponse adjustment methods. 


2. Imputation Methodology 


This section will begin with a summary of Rubin and 
Zanutto’s method, followed by a general description of our 
method that includes a discussion on out-of-sample versus 
within-sample matching, the details of the modeling and 
multiply-imputing tasks, and an analysis of efficiency as a 
function of the number of matched cases used. 


2.1 Matching, Modeling and Multiply Imputing 


Rubin and Zanutto (2001) proposed a method called 
“matching, modeling, and multiply imputing” (MMM) that 
uses matched substitutes to help generate multiple impu- 
tations for nonrespondents in sample surveys, without 
requiring that substitutes be perfect replacements for the 
nonrespondents. Matched substitutes are responding survey 
units chosen to match the nonrespondents on one or more 
‘matching covariates” — variables that are available prior to 
the survey and are convenient for matching but not neces- 
sarily for modeling. As a result of matching, nonrespondents 
and their substitutes may share similar values in their “field 
covariates” — variables that are only implicitly observed and 
are therefore not available for data analysis. “Modeling 
covariates” are variables that can be included in statistical 
models to adjust for observed differences between non- 
respondents and their substitutes, but that may not be 
available or used for matching. The essence of MMM is that 
both matching and modeling covariates are used, in the 
context of proper multiple imputation (Little and Rubin 
1987, pages 258 — 259 and references therein). 

Consider a simple example where age and address 
covariates are available for all units in a population prior to 
sampling. Finding substitutes matching nonrespondents 
with respect to both age and address may be difficult. An 
alternative is to match only on address (e.g., choosing a 
neighbor to be a substitute) and adjust for systematic age 
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differences between nonrespondents and matched substi- 
tutes through statistical modeling. If neighboring households 
were chosen as matched substitutes for nonresponding 
households, the substitutes and nonrespondents might have 
similar socioeconomic contexts (e.g., levels of crime, access 
to public transportation, efc.) even though these charac- 
teristics might have not been recorded. In this example, 
address is a matching covariate, age is a modeling covariate, 
and the contextual socioeconomic characteristics are field 
covariates. 

In summary, MMM (1) chooses matched substitutes for 
nonrespondents and some respondents based on matching 
covariates, (ii) uses modeling covariates to fit a model 
estimating the systematic differences in responses between 
pairs of respondents and substitutes, (iii) multiply-imputes 
the unobserved values using the model in (ii) under the 
assumption that the same relationship holds between pairs 
of nonrespondents and substitutes, and (iv) discards all 
matched substitutes after imputation. 


2.2 Out-of-Sample Versus Within-Sample Matching 


Matched cases may be obtained from out-of-sample data 
or within-sample data. In the Rubin and Zanutto approach, 
matched substitutes are obtained from out-of-sample data 
after the missingness is detected. Their description empha- 
sizes that the matched substitutes must be discarded after 
imputation since including such additional cases in infer- 
ences would modify the sample design by adding extra 
cases in the “blocks” that contain unobserved data. Matched 
cases are considered within-sample data if they are obtained 
from the database that is available before imputing or even 
finding out which records in the database have unobserved 
variables. As far as the overall inferential goals are con- 
cerned, these matched cases are not additional cases, but are 
part of the original data collection, and therefore will be 
included in scientific analyses. 

Assuming within-sample matching, we treat the un- 
geocodable records as nonrespondents and the geocodable 
records as respondents. For each ungeocodable record, a 
given number of matched cases are randomly chosen from a 
pool of geocodable records within the same small geograph- 
ical area (e.g., zip code, which is a postal delivery code 
usually representing an area served by a single main US 
post office). Similarly, the same number of matched cases 
are also chosen for each of the randomly sampled geo- 
codable records (see Rubin and Zanutto (2001) for recom- 
mendations on the size of such a sample relative to the total 
number of ungeocodable records in a given dataset). If more 
matched cases were needed than those are available in the 
same small area, the selection pool would be extended to the 
“nearest” geographical areas until the required number of 
matched cases was achieved. 
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All matched cases in the colorectal cancer study came 
from the same cancer database. In general, matched cases 
need not be drawn from the same population in which the 
nonrespondents and respondents originated. For example, 
matched cases for colorectal cancer records can be obtained 
from a general population of cancer patients, and a model 
can then be fitted to correct for systematic differences. Note 
that, with matched cases from a more similar population, 
stronger models can be built with more covariates. In our 
example, since we used other patients with the same cancer 
type, relationships to treatment process and outcome 
variables are likely to be consistent. 


2.3 Modeling and Multiply-imputing 


A simple example of our method is given here to convey 
the basic idea; in practice, more complex models may often 
be required. Suppose the following relationship holds in the 
population, 


Vp aXe bt O,bby, (1) 


where i indexes small geographical area, k indexes unit 
within area, and y, and x, are respectively the response 
and the characteristics of the k* unit in geographical area 
i. This model includes a regression prediction x} B, a 
small-area effect 5,, and a unit-specific residual €,. We 
assume that €, follows some distribution F, with mean 
zero and variance 67. Note that this development 
generalizes directly to multivariate y,,. 

We extend Rubin and Zanutto’s method to allow more 
than one match in the same small area, because having 
several matches in small areas is possible (often convenient 
and inexpensive) in census data or in large administrative 
datasets. Rubin and Zanutto’s assumption of a single match 
is appropriate to survey data collection that requires 
additional field work for each match. 

The regression coefficients in equation (1) are estimated 
using any collection of observations with two or more 
records per small area to fit the regression model in which 
the 5, are treated as fixed effects. With only two cases per 
area, f can instead be estimated from the within-area 
regression 


Wier Vn) = OH X12) BRE s&s (2) 


where the small area effect drops out. The residuals from 
this regression have a symmetrical distribution with vari- 
ance 207. 

Assuming for the moment that we have a draw from the 
posterior distribution of B, we carry out the rest of this 
analysis conditional on that draw. Now suppose that we are 
interested in imputing for a new unit (indexed as k =0) in 
area i, and that we have obtained K, 21 matched cases for 
this unit. Denote the outcomes of these matched cases by 
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the vector y;=(yj,.--. Vix)’ and the corresponding 
characteristics by the matrix X; = (Xj, ..., Xin)’. With a 


flat prior for 6,, the posterior distribution for 5, | y,, X,, B 
has mean 


y¥,—x7 B (3) 


and variance 0?/K,, where y,=>j‘,y,/K, and X,= 


Dei, x,/K,. Hence, the predictive distribution for 
Yio ly; Xi; Xi0> B has mean 
Yi + (Xin —¥7)B (4) 


and variance (1 + 1/K,)6? which is the sum of the 
predictive variance under the model conditional on all 
parameters and the posterior variance of 6,. These 
statements assume that the mean of the residuals is a 
sufficient statistic for 5,. This assumption is true for the 
normal distribution (or natural observations of any 
exponential family distribution); we assume it is at least 
approximately true for F., so that we can base inferences 
on that mean. Note that use of a flat prior leads to 
overdispersed draws relative to what would be obtained 
with a proper prior from a hierarchical model, but is much 
simpler (especially in analyses with the multivariate 
outcomes). 

An imputation for y,, can be generated by first drawing 
6? from its posterior distribution, second drawing B 
conditional on the draw of 07, third computing the 
predictive mean in equation (4) from the draw of B, and 
finally adding a residual of variance (1 + 1/K;) 0? to the 
predictive mean. In simple surveys with B estimated by 
equation (2), the posterior distribution of B (conditional on 
oO? and the data) under a flat prior is approximately 
N(B, (X? X)-“07) where the: i® row of X is 
(xi —x},). In more complex designs, the posterior 
distribution of B can be approximated using the point 
estimate and sampling variance calculated under the 
associated design. 

The residual can be obtained through modeling or 
sampling. Modeling involves estimating 0? using the 
residual variance of equation (1) and drawing the residual 
under univariate normality (see Rubin and Zanutto (2001) 
for the special case where only one matched case was ob- 
tained for each record) or some other parametric distri- 
bution. We refer to such an approach as parametric MMM 
(PMMM). An alternative is to randomly sample a 
regression residual from any area j whose residuals might 
be regarded as exchangeable with those from area i (Rubin 
1987 pages 166-168). See also Lessler and Kalsbeek 
(1992, section 8.2.2.4), Kalton and Kasprzyk (1986), and 
Kalton (1983). Since the variance of such a residual is 
[((K, -1)/K,]o?, we multiply the randomly-sampled 
residual by J [(K,;+1)/K;][K,;(K,;—1] to obtain the 
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correct predictive variance. We call this 
nonparametric MMM (NpMMM). 


approach 


In summary, our method consists of three basic steps: 


1. Draw matched cases for the ungeocodable records 
and for some randomly sampled geocodable records; 


2. Use the sampled geocodable records and _ their 
matched cases to fit equation (1) where the 6, are 
treated as fixed effects, and save the residuals; 


3. Repeat the following for m (usually 5 to 10) times: 


(a) Draw o? from its posterior distribution, then B 
conditional on the draw of 62; 


(b) For each ungeocodable record, treat the sum of 
the vector of predictive means obtained from 
equation (4) and a vector of residuals drawn 
using either PMMM or NpMMM as a realization 
of the unobserved vector of contextual variables. 


2.4 Efficiency 


The efficiency of an imputation is related to the number 
of matched cases used. Let V, be the predictive variance of 
an imputation model where K matched cases per record are 
used. For the model in section 2.3, V, =(1 + 1/K) 07. 
Define efficiency as 

V. Or K 


E a : 5 
eV, (1 +1/K)o2? K +1 im 


for any positive integer K. Efficiency increases as the 
number of matched cases per record increases; for example, 
EB O67) = 0.8; EF = 0.91) andes 0.95; 

Theoretically each record can have as many matched 
cases aS permitted by available resources. In practice, the 
number of matched cases used often depends on the cost of 
matched cases and the cost of computation involved in 
model fitting. In our method, the cost of computation for 
each added matched case per record is negligible. In the 
colorectal cancer study, while the matched cases were free, 
the ability to do the imputation based on a limited number of 
matched cases was crucial because confidentiality restrict- 
tions prevented investigators from using the entire dataset in 
modeling with zip codes (even in a coded form) attached. 
For illustrative purposes, we will use two matched cases per 
record in subsequent analyses. 


3. Application: The Colorectal Cancer Study 


The colorectal cancer database has a total of 50,740 
patient records, of which approximately 3.3% are un- 
geocodable. Among these, about half have P.O. box 
addresses (often in a rural area), and the rest are mistyped 
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addresses or addresses from newly developed areas that are 
not in address databases. In a study of factors predicting 
provision of chemotherapy for colorectal cancer patients, 
investigators believed that the following three census block- 
group means would be useful contextual variables: 


Y, _=median household income, 
Y, =percent with no high school diploma, and 
Y, =percent below poverty level. 


These variables were observed in geocodable records but 
unobserved in ungeocodable records. The task was to 
generate multiple imputations for the unobserved census 
variables using the methods in section 2. 

Each of the block-group means was reported in the 
census data for six race/ethnic groups, and the scientific 
analyses used only the set of block-group means 
corresponding to the race/ethnicity of each patient. For 
imputations used in Ayanian et al. (2003), we therefore 
fitted six separate models to impute all 18(6x3) values for 
each ungeocodable patient and then selected the three 
variables pertinent to each patient; joint distributions for 
different race/ethnic groups were not important because 
each imputation only used values for a single group. An 
alternative would have been to use race as a matching 
variable, but this would have forced us to seek some 
matches at a much greater distance geographically, diluting 
the predictive value of the geographical match. 

For expository purposes, we assume henceforth that only 
the block-group mean corresponding to the race of each 
respondent is available, but not the means corresponding to 
the other five races that are available simultaneously in the 
census data. This is more typical of data that would be 
collected directly from the respondent, where the race 
variable itself (as a modeling variable) is quite predictive 
because income data for people of different races reflect 
differences in income associated with race. 


3.1 Matching and the Dataset 


The addresses of over 90% of ungeocodable records 
have zip codes. Zip code was therefore chosen as a 
matching covariate. A simple diagnostic for its usefulness 
appears in section 3.2. The numerical sequence of zip codes 
does not always correspond to neighborhood distance 
relationships. For example, Cambridge, Massachusetts has a 
02138 post office that also uses the 02238 zip code for 
mailboxes, and in nearby Boston there is a 02215 zip code 
that was carved out of the 02115 area. Instead of using the 
numerical sequence of zip codes, the distances between zip 
codes were computed based on latitudes and longitudes of 
their main post offices, under the assumption that two zip 
codes were closest to each other if their main post offices 
were closest to each other. 


Survey Methodology, June 2005 


The colorectal cancer database has 1,696 ungeocodable 
records. The same number (n* =1,696) of geocodable 
records was randomly selected from the same database. For 
each of these 3,392 records, two matched geocodable cases 
were randomly chosen from its own zip code or (if 
necessary) neighboring zip codes. This created a dataset 
with 3,392x3=10,176 records. Note that n* was a 
convenient choice, because the data were free. In general, 
the choice of n* could affect both the total cost and the 
precision of the estimates. Both the randomly selected 
geocodable records and the matched cases were within- 
sample data and hence were retained in the analyses for 
Ayanian etal. (2003). We asked the cancer registry for 
these cases only because for confidentiality purposes we 
could not do the matching ourselves with the data (for the 
same cases) that we had in hand. 

The modeling covariates used in the imputation model 
were the eight administrative-record variables: age, sex, 
race, marital status, cancer stage, chemotherapy treatment, 
cancer type and radiotherapy treatment, and category of 
treating hospital’s American College of Surgeons accred- 
itation as of 1999 (ACOS99). These variables are observed 
for all 10,176 records included in the imputation model. 
(Some of these variables are predictors and some are 
outcomes in the scientific models of the main analyses, but 
the distinction is irrelevant for imputation.) The census 
mean values Y,, Y, and Y,; are observed in geocodable 
records, but not in ungeocodable records. These variables 
were treated as outcome variables of the imputation model 
in section 2.3. The data structure is represented by Table 1. 


Table 1 
Structure of Data Used in Imputation for the 
Colorectal Cancer Study 


Eight Modeling Census 

Data* Covariates Variables 

iNge. Sexy Gi VACOS99 "KY, (Yo UY; 
Ungeocodable ¥ V V ae ee hd 
First Match V PATA Era | 
Second Match Vv V es | 
Geocodable se | V eae 
First Match V V Vv Vv 
Second Match VY ¥ V ee 


* There were 1,696 records in each of the six types 
of data. 


V = observed ? = unobserved 


Before we fitted the model, the percentage outcomes y, 
and y, were transformed using the scaled-logit function: 


(y-ayb-a) 
reer %s 


with a=—0.5 and b=100.5 so that after imputations the 
inverse transformation with rounding to the nearest integer 
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would yield imputed values between 0 and 100 inclusive 
(Schafer 1999). Similarly, a log-transformation was applied 
to the income outcome y, so that the imputed incomes 
would be nonnegative. Note that the distributions of the 
transformed variables are closer to normality than they are 
on the original scale (Schafer 1997). To keep notation 
simple, we redefine y,, y, and y, as their transformed 
versions. 


3.2 Preliminary Diagnostics 


A simple diagnostic test for the usefulness of the 
matching covariates is to compare the adjusted R2 for the 
regression models predicting the three census variables with 
only the modeling covariates, the models with only the 
matching covariates, and the models with both. In this 
application, zip code was the only matching covariate. 
There were 1,133 distinct zip codes (hence 1,132 dummy 
variables) in the 8,480 fully observed records (the 
geocodable records and all first and second matches). Table 
2 shows the adjusted R?* for models with only the eight 
modeling covariates, models with only zip code, and models 
with both modeling covariates and zip code. The adjusted 
R? for models with both modeling covariates and zip code 
are higher than the corresponding ones for models with only 
one of the two covariate types. Our imputation procedure 
uses information from both matching and modeling 
covariates and thus can be expected to work better than 
procedures using only the matching or the modeling 
covariates (as shown by the simulation study in section 4). 
Although the contribution of the modeling covariates to R? 
is relatively modest, their inclusion is important for 
removing systematic biases and properly representing 
relationships that might be important in the scientific 
models. 

Table 2 
Adjusted R? for Alternative Regression Models 


Only Only Matching Both Modeling 


Modeling Covariate and Matching 

Covariates (Zip Code) Covariates 
Median household income (INC) 0.091 0.453 0.496 
Percent with no high school 
diploma (EDU) 0.115 0.452 0.503 
Percent below poverty level (POV) 0.047 0.327 0.343 
Model degrees of freedom 26” 1,133 1,158 
Sample sizes 8,480 8,480 8,480 
Residual degrees of freedom 8,454 7,347 7322 


(a) With intercept. 

(b) The modeling covariates are age, sex (2 levels), race (6 levels), 
marital status (6 levels), cancer stage (6 levels), chemotherapy 
treatment (2 levels), cancer type and radiotherapy treatment (3 
levels), and category of treating hospital’s American College 
of Surgeons accreditation as of 1999 (6 levels). 


To determine whether a multivariate model was needed, 
we fitted a multivariate-outcome regression model with both 
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modeling covariates and zip code. The estimated correla- 
tions between the residuals were: 17, ~—0.194, 
r,; ~—-0.297, and 1, ~ 0.357, where “variable 1” is 
median household income, “variable 2” is percent with no 
high school diploma, and “variable 3” is percent below 
poverty level. These estimates were significantly different 
from zero, which therefore indicated that multivariate 
versions of the methods in section 2.3 should be used to 
generate imputations. 


3.3 Multiple Imputation Results and Comparisons 


Imputations under NpDMMM were used in the study of 
factors predicting provision of chemotherapy for colorectal 
cancer patients (Ayanian ef al. 2003). Their model included 
three indicator variables for ranges of contextual income, 
together with 21 other variables representing patient and 
hospital characteristics. The multiple imputation analysis 
shows that the information loss due to missing information 
is always less than 0.1%, which is much smaller than the 
fraction of ungeocodable records (3.3%). As expected, the 
largest fractions of missing information appeared for the 
income variables. The scientific results in Ayanian et al. 
(2003) would not have changed dramatically if the 
incomplete cases had been dropped. In this type of research, 
however, every case is precious and expensive, and saving 
the 3.3% with missing data was a contribution to the study. 

For comparison, variances of parameters under the 
complete-case analysis were on the average 4.0% larger 
than those under multiple imputation analysis. Such 
percentage differences are close to the fraction of 
incomplete cases deleted for this analysis. When the 
imputations generated by our method were included in the 
scientific analysis, the precision of the estimate of the 
“rural” effect was dramatically improved (using only the 
complete cases led to 41.6% increase in variance), due to 
the concentration of ungeocodable records in rural areas 
(21.6% of rural records are ungeocodable, but only 3.1% of 
nonrural records are ungeocodable). 


4. A Simulation Study 


This simulation study compares performance of our new 
method with three other commonly-used nonresponse 
adjustment methods. The population of this study was the 
1,696 fully observed triples — the 1,696 geocodable records 
and the corresponding first and second matches (one row 
from each of the last three horizontal blocks in Table 1) — or 
5,088 observations. For simplicity, we assumed that the 
triples were from distinct zip codes (clusters), hence 
i=1, 2, ..., /=1,696. Each cluster i contained three units 
(u=1, 2, 3), and the record of each unit consisted of x,, 
(the covariates) and y,, (the census variables). 


Statistics Canada, Catalogue No. 12-001-XPB 


4.1 Simulated Data and Response Mechanism 


Assuming that the design was cluster sampling with 
sample size 800, we drew random samples of 800 clusters. 
For each random sample, about half of the 800 clusters were 
randomly selected to have an ungeocodable record in which 
the census variables were unobserved, with the probability 
of missingness depending on an individual’s race and on the 
mean income of the cluster (zip code). We simulated 
missingness under a multinomial logit model where the 
outcomes are: nothing unobserved (w,,=1), y,, unob- 
served (w,=1), y;, unobserved (w,,=1), and y,, 
unobserved (w,, =1). Specifically, for each i=1, 2, ..., J, 
let z,) =O and 


Zi, =a+bxI (unit iu is White) 


+ c X (mean income in zip code i) (7) 


where uv =I, 2, 3. Then 


Pr(w,, =1) = exp(Z;,,) S exp (zi) 


u=0 


for u = 0, 1, 2,3. 8) 


The results of this simulation study were based on 
datasets generated by the mechanism with a=-—l, b=11 
and c=0.0003, which made about 17% of the units in a 
random sample ungeocodable, with probability of 
geocoding positively related to White race and higher block- 
level income. The task was to use the random sample to 
estimate y, the mean values of the population (1,696 
clusters). 

The simulation conditions described in the preceding 
paragraphs were designed to give a stringent test of the 
procedure and alternatives by exaggerating the impact of 
unobserved data and making the missingness strongly 
related to characteristics both of the individual and of the 
area. We were not attempting to simulate the exact con- 
ditions of the application in section 3 but rather to use an 
artificial population with similar distributions to those in the 
real population to illustrate the workings of our method and 
its competitors. 


4.2 Inferential Methods and Measures of 
Performance 


Preliminary results indicated that the performance of 
PMMM and NpMMM is similar; NoMMM is, however, 
simpler (especially in analyses with multivariate outcomes), 
because the method does not require explicit parametric 
modeling of the residual variance. Our simulations com- 
pared performance of NDMMM (using two matched cases 
per record) with three other commonly-used nonresponse 
adjustment methods: 


Survey Methodology, June 2005 


1. Complete-case Method (CCM) 
The population means are estimated from all geo- 
codable units of a random sample. 

2. Substitute Single Imputation (SSD 
This is the traditional use of substitutes. The un- 
observed census variables of each ungeocodable unit 
are replaced by the values of the census variables of a 
randomly selected unit from the same cluster. The 
resulting sample is treated as if there had been no 
ungeocodable unit; all 800 clusters in such a sample 
are used for estimating the population means. 

3. Multivariate Normal Multiple Imputation 
(MNMI) 
This method uses only one randomly selected unit 
from each of the fully observed clusters in a random 
sample to fit the multivariate normal linear regression 


y? ~N(B5 +x? B, &), 


with a noninformative prior on the parameters. The 
model is then used to create m sets of multiple 
imputations for the unobserved census variables using 
a direct multivariate generalization of the algorithm 
given by Rubin (1987, page 167). 


Note that CCM uses neither matching nor modeling 
covariates, SSI uses only the matching covariate (zip code), 
MNMI1 uses only the modeling covariates, and NPMMM 
uses both the matching covariate and the modeling 
covariates. 

The CCM and SSI data are analyzed by the usual 
complete-data method which estimates the population mean 
from the data with the appropriate estimator for cluster 
sampling from a finite population, including the finite 
population correction (Cochran 1977, Chapters 9—10). Both 
MNMI and NpMMM produce m sets of complete data, 
each of which is analyzed by the same complete-data 
method used for the CCM and SSI data; the m sets of point 
and variance estimates are then combined using the multiple 
imputation combination rule (Rubin 1987; Schafer 1997, 
pages 108—110). 

For each simulation ¢ € {1, 2, .... T}, we denote the 
point estimates from the four methods by y(t), 
Yss(t), Yun (t), and yy,(¢), and the means of these 
quantities across simulations are written as Yq, 
Yss> Yun» and Yxy,. Performance evaluation of the four 
nonresponse adjustment methods will be based on three 
measures: 

1. Percent reduction in the average bias of an 
estimator relative to the average bias of the CCM 
estimator. Denote the average bias of an estimator by 
b,. Then 
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where Ee{CC, SS, MN, Np}. We define the 
percent reduction in the average bias of an estimator 
relative to the average bias of the CCM estimator as 


where 5, is an element of b, and bc is the corre- 
sponding element in b,,. By definition, R (bec, bec) 
is Zero. 


2. Estimated coverage of the nominal 95% confidence 
intervals for y. Intervals produced by the CCM or 
SSI estimates were constructed under appropriate 
t—distributions. For intervals associated with the 
MNMI or NpMMM estimates, we followed the 
procedure outlined in Schafer (1997, pages 109-110) 
and replaced the degrees of freedom v with the 
updated version of Barnard and Rubin (1999). 


3. Estimated fraction of missing information about 
y. For each of MNMI and NpMMM, we computed 
ds an estimate of the fraction of missing information 
about y (see Barnard and Rubin (1999) for the most 
recent expression). 


4.3 Results 


The simulation procedure was implemented 2,000 times, 
and m=10 was used for MNMI and NpMMM. The mean 
values of the census variables in the population were 
y = (40,642, 21.65, 9.55)7. The average bias of the CCM 
estimator was Bocy =(-5,405, —3.97, —1.79)7. Other 
results are summarized in Table 3. NDMMM achieved large 
percent reductions in relative average bias (95.0% to 
99.5%). SSI reduced biases more than MNMI, because the 
matching covariate (zip code) was much more informative 
than the set of modeling covariates (section 3.2). Since the 
response mechanism was nonignorable (the response 
probabilities depended partly on income), the poor 
performance of MNMI, which did not use the geographical 
information to help predict income, was expected. Note that 
MNM1 is biased, and the bias is large enough so that with 
the sample size considered in this paper the confidence 
intervals never covered the hypothetical population values. 

Under MNMI and NpMMM, the percent of missing 
information was much less than the average percent of 
unobserved data. The percent of missing information was 
smaller under NDMMM than under MNMI. Only NNMMM 
produced well calibrated intervals with correct coverage. In 
summary, NpDMMM combines the best features of the other 
two methods — close-to-nominal coverage and less missing 
information. 
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Table 3 
Simulation Results“: Bias Reduction, Coverage, and Fraction of 
Missing Information 


Measure Mean Method 

NpMMM MNMI SSI 
Percent bias INC 99.5 44.6 95.2 
Reduction EDU 95.0 40.6 83.7 
100R Ops Dec )(b) POV 96.8 32.6 80.3 
Estimated INC 95.1 0.00 89.8 
Coverage of the EDU 94.8 0.00 65.7 
95% Cis POV 95.2 0.00 66.0 
100 x Estimated INC 1.00 9.92 
fraction of missing EDU 0.05 0.07 
information 1°? POV 0.07 0.08 


(a) Based on 2,000 replications and m = 10. 
(b) By definition, 100R (becq, Bcc) = 9. 
(c) Results for the CCM estimates were all zeros. 
(d) The average percent of unobserved data was approximately 17%. 


5. Conclusion 


This work extends Rubin and Zanutto (2001) in two 
respects. First, our method allows more than one matched 
case per record. We show theoretically that the efficiency of 
an imputation increases as the number of matched cases per 
record increases. When the cost of matched cases is rela- 
tively low, our method offers an option where information 
of more than one matched case per record is used to help fit 
imputation models at a negligible computational expense. 
Second, NpMMM does not require explicit parametric 
modeling of residual variance(s), hence simplifying the 
modeling task (especially for analyses with multivariate 
outcomes). This nonparametric approach makes it feasible 
to apply our method to datasets with complex model 
structures. In a simulation study, NpMMM estimates 
achieved substantial bias reductions, and NpMMM 
produced confidence intervals with correct coverage. 

Although we have focused on geographically-based 
matching to complete unobserved geographically-linked 
variables, the procedures described in this paper can be 
generalized to other matching variables. For example, to 
impute clinical variables, it might be more appropriate to 
match to another patient in the same hospital, if clinical 
characteristics and therapies are likely to be more strongly 
associated with the hospital than with the geographic 
location of the patient’s residence. 
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Hierarchical Bayesian Nonignorable Nonresponse Regression 
Models for Small Areas: An Application to the NHANES Data 


Balgobin Nandram and Jai Won Choi’ 


Abstract 


We use hierarchical Bayesian models to analyze body mass index (BMI) data of children and adolescents with nonignorable 
nonresponse from the Third National Health and Nutrition Examination Survey (NHANES III). Our objective is to predict 
the finite population mean BMI and the proportion of respondents for domains formed by age, race and sex (covariates in 
the regression models) in each of thirty five large counties, accounting for the nonrespondents. Markov chain Monte Carlo 
methods are used to fit the models (two selection and two pattern mixture) to the NHANES III BMI data. Using a deviance 
measure and a cross-validation study, we show that the nonignorable selection model is the best among the four models. We 
also show that inference about BMI is not too sensitive to the model choice. An improvement is obtained by including a 
spline regression into the selection model to reflect changes in the relationship between BMI and age. 


Key Words: Cross-validation; Deviance; Metropolis-Hastings sampler; Normal-logistic regression model; Spline 


regression model. 


1. Introduction 


The National Health and Nutrition Examination Survey 
(NHANES III) is one of the surveys used by the National 
Center for Health Statistics (NCHS) to assess the health of 
the U.S. population. One of the variables in this survey is 
body mass index (BMI), and the World Health Organization 
has used BMI to define overweight and obesity. Under 
ignorability estimators from the NHANES III data are 
biased because there are many nonrespondents, and the 
main issue we address here is that nonresponse should not 
be ignored because respondents and nonrespondents may 
differ. The purpose of this work is to predict the finite 
population mean BMI for children and adolescents, post- 
stratified by county for each domain formed by age, race 
and sex and to investigate what adjustment needs to be 
made for nonignorable nonresponse. Our approach is to fit 
several hierarchical Bayesian models to accommodate the 
nonresponse mechanism. 

Recently, several articles have been written about over- 
weight and obesity. In outlining the first national plan of 
action for overweight and obesity, the Surgeon General 
called for sweeping changes in schools, restaurants, 
workplaces and communities to help combat the growing 
epidemic of Americans who are overweight or obese. He 
said that the obesity report “Is not about esthetics and it’s 
not about appearances. We’re talking about health.” As 
noted by Squires (2001) “Health care costs for overweight 
and obesity total an estimated $117 billion annually.” 
Overweight children often become overweight in adulthood, 


and overweight in adulthood is a health risk (Wright, Parker, 
Lamont and Craft 2001). In a very interesting article, using 
NHANES data Ogden, Flegal, Carroll and Johnson (2002) 
describe the most recent national estimates of the prevalence 
and trends in overweight among U.S. children and ado- 
lescents. Based on a limited analysis they conclude “The 
prevalence of overweight among children in the United 
States is continuing to increase especially among Mexican- 
American and non-Hispanic black adolescents.” Several 
disorders have been linked to overweight in childhood. A 
potential increase in type 2 diabetes mellitus is related to the 
increase in overweight among children (Fagot-Campagna 
2000); so are cardiovascular risk factor, high cholesterol 
levels, and abnormal glucose levels (Dietz 1998). Thus, it 
would be helpful to study the BMIs for children and 
adolescents using methods that can provide accurate 
adjustment for nonresponse and better measure of precision. 

Letting x denote covariates and y the response variable, 
Rubin (1987) and Little and Rubin (1987) describe three 
types of missing-data mechanism. These types differ 
according to whether the probability of response (a) is 
independent of x and y (b) depends on x but not on y and (c) 
depends on the y and possibly x. The missing data are 
missing completely at random (MCAR) in (a), missing at 
random (MAR) in (b) and one may say that the data are 
missing not at random (MNAR) in (c). Models for MCAR 
and MAR missing-data mechanisms are called ignorable if 
the parameters of the dependent variable and the response 
are distinct (Rubin 1976). Models for MNAR missing-data 
mechanisms are called nonignorable. 


1. Balgobin Nandram, Department of Mathematical Sciences, Worcester Polytechnic Institute, 100 Institute Road, Worcester, MA 01609-2280. E-mail: 
balnan@wpi.edu; Jai Won Choi, National Center for Health Statistics, 3311 Toledo Road, Hyattsville, MD 20782. E-mail: jwce7 @cdc.gov. 
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Nonresponse models can be classified very broadly into 
selection and pattern mixture models (e.g., see Little and 
Rubin 1987). Let [y] and [r] denote respectively the 
density function of the response variable y, and the response 
indicator r, with obvious notations for the joint and 
conditional densities. Then the selection model specifies 
that [y,r]=[r|y][y] and the pattern mixture model 
specifies [y,r]=[y|r][r]. The selection approach was 
developed to study sample selection problems (e.g., 
Heckman 1976 and Olson 1980). While the two models 
have the same joint density, in practice the components 
[r| y] and [ y] for the selection model, and [y|r] and [r] 
for the pattern mixture model are specified. Thus, these 
models may differ. 

Thus, we use two nonignorable nonresponse models, a 
selection model and a pattern mixture model, to analyze the 
NHANES III data. Each model is used in the hierarchical 
Bayesian frame work for our nonignorable nonresponse 
problem, and to study sensitivity to model choice the results 
are compared. In the selection model, the response 
propensity is related to BMI only, and then the model on 
BML has a linear model on age, race, sex and the interaction 
of race and sex. In the pattern mixture model, the propensity 
to respond is related to age, race and sex (not BMI), and the 
model on BMI has two closely related linear forms on age, 
race, sex and the interaction of race and sex. These two 
models hold for the entire population. The BMI values of 
the nonrespondents and the nonsampled individuals are 
predicted from each model. We prefer the selection model 
because we can incorporate the structure in the NHANES 
II data, and based on statistical arguments this turns out to 
be true. 

Greenlees, Reece and Zieschang (1982) developed a 
normal-logistic regression model for imputing missing 
values when the probability of response depends upon the 
variable being imputed. They applied the model to data on 
wages and salary in the Current Population Survey (CPS) 
data on wages. David, Little, Samuel and Triest (1986) 
compared the CPS hot deck method and the normal-logistic 
regression model to wages and salary from a similar data 
set, and they found very little difference between the two 
methods. We note that the normal-logistic regression model 
is a nonignorable nonresponse selection model, but it does 
not account for clustering. To accommodate clustering 
within counties in the NHANES III data, it is natural to start 
with the normal-logistic model. 

Our hierarchical Bayesian selection model has a special 
structure. In NHANES III the propensity to respond 
increases with age (race and sex play a minor role), and 
doctors believe that obese individuals tend not to turn up for 
the physical examination. Thus, given the BMI values, like 
Greenlees et al. (1982) the response indicators follow a 
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logistic regression model with the logarithm of the BMI 
values being the covariate. In turn, the logarithms of the 
BMI values are distributed according to a linear model in 
which the covariates are age, race and sex. This is the most 
important information we incorporate into the selection 
model. In addition, unlike Greenlees et al. (1982) our model 
includes clustering effects to account for heterogeneity 
among counties through the response indicators and the 
BMI values. Here each county has its own set of para- 
meters, and there is a common distribution over these sets of 
parameters. This is also an important prior information we 
incorporate into our model, and it is one of the attractive 
features of the hierarchical Bayesian methodology. 

In the Bayesian approach, the main difficulty is formu- 
lating the relationship between the respondents and non- 
respondents. This latter issue can be accommodated within 
the selection approach through the normal-logistic structure. 
We also consider a hierarchical Bayes model within the 
pattern mixture approach. The pattern mixture model is a 
useful alternative to study sensitivity to the assumption in 
the selection model. To assess the assumption of non- 
ignorable nonresponse, we also consider special cases of the 
selection and pattern mixture models to obtain two 
ignorable models. We found that a fifth model is required, 
in which we extend our selection model to a spline 
regression model to accommodate the dynamic relation 
between BMI and age. 

Nandram, Han and Choi (2002) developed a methodo- 
logy to analyze the BMI data by age, race and sex when 
BMI is categorized into three intervals. This is a multi- 
nomial extension of the nonresponse nonignorable analysis 
of Stasny (1991) for binary data. This methodology applies 
generally to any number of cells in several areas (counties in 
our application). Nandram and Choi (2002 a,b) consider 
further extensions of the work of Stasny for binary data (i.e., 
data from the National Health Interview Survey and the 
National crime survey). Here we do not categorize the BMI 
values, but rather we treat them in their own right as 
continuous values. The quantities of interest are the finite 
population mean BMI and the proportion of responding 
individuals in each domain formed by age, race, sex and 
county. 

The rest of the paper is organized as follows. In section 2, 
we briefly describe the NHANES III data. In section 3, we 
discuss the hierarchical Bayesian models for ignorable and 
nonignorable nonresponse. We also describe the model 
fitting, model selection and assessment which use predictive 
deviance and cross-validation. In section 4 we describe the 
analysis of NHANES III BMI data. Section 5 has a 
description of a spline regression model and comparisons. 
Finally, section 6 has concluding remarks about our 
approach. 


Survey Methodology, June 2005 


2. NHANES III Data 


The sample design is a stratified multistage probability 
design which is representative of the total civilian non- 
institutionalized population, 2 months of age or older, in the 
United States. The number of sampled individuals in each 
age-race-sex group is known for each county. The sample 
size by county, age, race and sex are relatively sparse. 
Further details of the NHANES III sample design are 
available (National Center for Health Statistics 1992, 1994), 

The NHANES III data collection consists of two parts: 
the first part is the sample selection and the interview of the 
members of a sampled household for their personal infor- 
mation, and the second part is the examination of those 
interviewed at the mobile examination center (MEC). The 
health examination has information on physical examina- 
tion, tests and measurements performed by technicians, and 
specimen collection. 

The sample was selected from households in 81 counties 
across the continental United States during the period from 
October 1988 through September 1994, but for confi- 
dentiality reasons the final data of this study came from only 
the 35 largest counties (from 14 states) with population at 
least 500,000 for selected age categories by sex and race. In 
this paper, we analyze public use data from these 35 
counties; the demographic variables are age, race and sex, 
and the health indicator of our interest is body mass index 
(BMD), weight in kilograms divided by the square of height 
in meters (Kuczmarski, Carrol, Flegal and Troiano 1997). 
The World Health Organization (WHO Consultation of 
Obesity 2000) has designated an adult with BMI at least 30 
as obese; overweight refers to adults with BMI in the range 
[25, 30). For children 1—6 years old and adolescents 7-19 
years old overweight and obesity are age-dependent. 

Nonresponse occurs in the interview and examination 
parts of the survey. The interview nonresponse arises from 
sampled persons who did not respond for the interview. 
Some of those who were already interviewed and included 
in the subsample for a health examination missed the 
examination at home or at the MEC, thereby missing all or 
part of the examinations. Here we do not consider the small 
number of individuals whose BMI values and covariates 
(age, race and sex) are missing (i.e., unit nonresponse). For 
simplicity and for all practical purposes it is reasonable to 
include all individuals with their covariates (i.e., complete 
data and item nonresponse) reported in our data analysis. 
Cohen and Duffy (2002) point out that “Health surveys are 
a good example, where it seems plausible that propensity to 
respond may be related to health.” We note also that for 
children and adolescents the observed nonresponse rate is 
about 24%. A partial reason for the nonresponse for young 
children is that the parents or older mothers were extremely 
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protective and would not allow their children to leave home 
for a physical examination. 

We study the BMI data for four age classes (02 — 04, 
05 — 09, 10 — 14, 15 — 19 years). Recalling that there are 560 
(35 x 4 x 2 x 2) domains, the sample sizes on the average 
are very small per domain (e.g., 2,647/560 ~ 4). Thus, there 
is a need to “borrow strength” from the domains. Also, the 
sample size is small relative to the finite population size 
(e.g., 100 x (2,647/6,653,738) = 0.04%). The prediction 
problem needs much computation. The observed data 
indicate that there is an increasing trend of BMI with age 
with slightly increasing variability. 

NHANES III data are adjusted by multiple stages of ratio 
weightings to be consistent with the population; see 
Mohadjar, Bell and Waksberg (1994). In this ratio-method, 
item nonresponse adjustment is done by ratio estimation 
within the same adjustment class and the distributions of the 
respondents and nonrespondents are assumed to be same. 
There is a need to consider methods for handling non- 
ignorable nonresponse other than the ratio-adjustment 
method. Here we present a Bayesian method as a possible 
alternative for studying NHANES III nonresponse. 

Schafer, Ezzati-Rice, Johnson, Khare, Little and Rubin 
(1996) attempted a comprehensive multiple imputation 
project on the NHANES III data for many variables. The 
purpose was to impute the nonresponse data in order to 
provide several data sets for public use. As one of the 
limitations of the project they stated “the procedure used to 
create missingness corresponds to a purely ignorable 
mechanism; the simulation provides no information on the 
impact of possible deviations from ignorable nonresponse.” 
Another limitation is that the procedure did not include 
geographical clustering. Our purpose is different; we do not 
provide imputed public-use data. Unlike Schafer et al. 
(1996), we include clustering at the county level, although 
there may be a need to include clustering at the household 
level. For the complete data there are 6,440 households. Of 
these households 52.1% contributed one person to the 
sample, 22.5% two persons, and 21.4% at least three 
persons. We have calculated the correlation coefficient for 
the BMI values based on pairing the members within 
households (see Rao 1973, page 199). It is 0.19 which 
indicates that as a first approximation the clustering within 
households can be ignored. 

For our current application, inference is required for each 
age, race and sex domain within county. One standard small 
area estimation method is to identify each small area by a 
parameter, and then assume a common stochastic process 
over the 560 parameters. But because of the sparseness of 
the data, this is not desirable. Thus, our models are 
constructed at county level, and at the same time age, race 
and sex are represented as covariates. Inference is made for 
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each domain formed by crossing age, race and sex within 
county through our regression models. This is a key point in 
our analysis. 


3. Hierarchical Bayesian Methodology 


In this section we describe two Bayesian models for non- 
ignorable nonresponse, and we deduce two additional 
ignorable models as special cases. We describe the model 
selection and assessment for the selected model (i.e., the 
selection model). 

There are data from ¢=35 counties and each county has 
N, (known) individuals. We assume a probability sample 
of n, individuals is taken from the i" county. Let s denote 
the set of sampled units and ns the set of nonsampled units. 
Let 7, for i=1,2,...,@ and j=1,2,..., N,; be the response 
indicator (7, =1 for respondents a ry, =0 for non- 
respondents) for the j ® individual within nee ce county in the 
population. Also, let x, be the logarithm of the BMI value. 
We found that the logarithm transformation gives a better 
representation, and we use it throughout. Note that li and 
x, are all observed in the sample s but they are unknown in 
os Let r=, Ge, 7 is me number of sampled 
individuals that responded in the i ™ county). 

For convenience, we express the BMI x; as x;),;9,..., 
Xin» Xinsi> c++» Xin, INS ANA Xin i> +++> Xjy, In ns for county i. A 
key point that we note for what follows is that the r, indi- 
viduals are not necessarily random respondents from the n, 
individuals randomly sampled. This is the nonresponse bias 
we need to address. It is clear that we need to predict the 
BMI value x; for (a) the nonrespondents in s and (b) the 
individuals in ns. Thus, for the finite population of N, 
individuals, we need a Bayesian predictive inference for 

N; N. 
Se Pasi xij and ebRe pape Tj 
N I N i 
(OC bee 

Letting x;"” = iat Xj /%, ihe Dijen si %y MN; —G) 

and x," =>". x,;/.N; —n;,), we note that 


gy xem by (l— f,) x (1) 


where f,=n,/N, and g\ =r,/n,. Note that while the f, 
are fixed by design, the g, and x x ” are observed. Also, 
letting pi =r,/N, and per) = (> EN To). 


Ne fig XO? +(- 


J=n;+1 Vj 

Papper apas (2) 
i=1,...,¢. We develop our hierarchical Bayesian models 
to perform predictive inference for quantities like (1) and (2) 


depending on the domain. 
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3.1 Competing Models 


Our models have two parts, one part for the response 
mechanism and the other part for the distribution of BMI. 
These two parts are connected to form a single model under 
nonignorable nonresponse or ignorable nonresponse. 

First, we describe the selection model. For Part 1 of this 
model the response depends on the BMI as follows 


eroit Bi %j 
r, |xy,B, ~ Bernoulli 4}, G3) 
J 1+ ePot Pui 
(Bos Be dOn en x a0) 
iid 
~ BVNormal(6,,6,; 67, 63, P,); (4) 


6~ N(@, A”), 6,7, 0, ~ Gamma(a/2, a/2) 


and , ~ Uniform (—1, 1), (5) 


where a,0° and A are to be specified. Note that the 
prior densities in (5) are all jointly independent. The 
assumption (3) is important because it relates the response 
propensity to the BMI values; doctors believe that over- 
weight and obese individuals tend not to come to the MECs 
for the examinations. Clustering among the counties is 
accommodated by (4), and it is this assumption that permits 
a “borrowing of strength” among the counties. 

The second part of the model is about the BMI. The 
single most important predictor of BMI is age, with race 
and sex playing a relatively minor role. One possibility is to 
take the BMI values to be 


gies Hie Spent a a ne its 
where a, denotes age and é«, oe ~ Normal (0, 63) for 
bili: actly ANG. of = Lycra eg. there is a need to under- 
stand the relationship between BMI and age, race and sex. 
We let z,. =1 for an intercept, z,,=1 for non-black and 
Z;;, =0 for black, z,,. =1 for male and z,,=0 for female, 
Zi3 = Zj1 Zj2 for the interaction between race and sex, and 
we let Z;, =(Zijos Zj1» Zj2» Zj3) Then, for a regression of 
BMI on age adjusting for race and sex, letting 
Gy = (Olg1, Ogg, Ogg, Loy) ANd, = (O4,, Oy, O13, O44), 
we take Ol, =Z;, @, + Vo; and O,,, =Z; @, +, to get 


, 7 
Li, = (Zi; @, +V0;) + (Zi Oy + V4; ) a; 
where vp); and v,, are random effects centered at zero with 


bivariate normal distribution shown below for each model. 
Thus, in Part 2 of the selection model, we assume 


, ig 
xij = (Z;; Ga, +Vo;)+(Zj a, +Vi;) a; +e; 


iid 
and e, |o; ~ Normal(0, 03), (6) 
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iid 
Wa, Vi;) |S 4. S-,P> ~ BVNormal(0,0;07,02,[>).. () 


Again, clustering among the counties is accommodated 
by (7), and it is this assumption that permits a “borrowing of 
strength” among the counties. For this part of the model, we 
use the prior 


a, ~ Normal(a$”, AS’) and a, ~ Normal(a{”, A®’), 


iid 


Oro; ; Gre ~ Gamma(a/2,a/2) and 
iid 
p, ~ Uniform(-1, 1) (8) 


where a,a\” and A’, k =1,2 are to be specified. Note 
that the prior densities in (8) are all jointly independent. 

The nonignorable nonresponse pattern mixture model is 
presented in Appendix A. We have included race, sex and 
their interaction in the response part of the model, although 
these turn out to be unnecessary. The difference between the 
respondents and the nonrespondents in the pattern mixture 
model is that the intercepts in the regression vary with 
counties for the respondents but not for the nonrespondents; 
other parameters are the same. In this way we are able to 
“center’ the nonignorable nonresponse model on_ the 
ignorable nonresponse model with some variation; see 
Nandram and Choi (2002 a) for a similar idea. We need to 
do so because the parameters become unidentifiable if 
substantial difference between the respondents and the 
nonrespondents is assumed in the nonignorable nonresponse 
model without the scientific knowledge. While we have 
used random effects to discriminate between the 
respondents and the nonrespondents, the parameters 
providing systematic difference between the respondents 
and nonrespondents in model of Rubin (1977), are not 
identifiable. Note that while in the pattern mixture model in 
(A.4) there are two specifications/patterns for x; (ce., 
r;=0 and 7, =1), but in the selection model there is a 
single specification. 

We show how to specify parameters like 9 A, 
a A” ,k =1,2 in Appendix C. For a proper diffuse prior 
we choose a to be a value like 0.002. One can also use a 
shrinkage prior on 6,° and o,° (see Natarajan and Kass 
2000; and Daniels 1999). But this is not necessary in the 
hierarchical model. 

It is an attractive property of the hierarchical Bayesian 
model that it introduces correlation among the variables. For 
example, in the selection model, (4) and (7) introduce a 
correlation among the 7, and the x,,, respectively. This is 
the clustering effect within the areas. Such an effect can be 
obtained directly, but it will not be as simple as in a 
hierarchical model. A further benefit of the hierarchical 
model is that it takes care of extraneous variations among 


77 


the areas; this is intimately connected to the cluster effect. 
Yet another benefit is that there is robustness in the model 
specifications at deeper levels beyond the sampling process 
(e.g., inference with (5) and (8) is fairly robust to moderate 
perturbations of the specifications of the hyperparameters). 
We have found this empirically here and elsewhere. 

We obtain an ignorable nonresponse selection model by 
setting B,, =0 for all counties with appropriate adjustment 
in the selection model. For an ignorable nonresponse pattern 
mixture model we set x, =(Z; 0, +V;) + (2; @, +v,) 
a,,+e€,, for both values of the 7;,. 


3.2 Model Fitting 


In this section we describe how to use the Metropolis- 
Hastings sampler to fit the models. We also use a deviance 
measure to select the best model among our four models. 
Then, we use a cross-validation analysis to assess the 
goodness of fit of the selected model, and because the same 
general principle applies to the four models, we describe 
model fitting for the selection model only. 

Thus, we now combine the model for the response 
mechanism and the model for the BMI values to obtain the 
joint posterior density of all the parameters. The x, for 
j=r,+1,...,n,,i=1,..., 4 are unknown; that is, they are 
latent variables. We denote these latent variables by x“’”” 
and the observed data are denoted by x°*. Using Bayes’ 
theorem to combine the likelihood function and joint prior 
distribution, we obtain the joint posterior density which, 
apart from the normalization constant, is 
p(x" ,67,a,B8,v,0,p,,p, |x” ) and is given in 
(B.1) in Appendix B. 

The posterior density in (B.1) is complex, so we used 
Markov chain Monte Carlo (MCMC) methods to draw 
samples from it. Specifically, we used the Metropolis- 
Hastings sampler (see Chib and Greenberg 1995 for a 
pedagogical discussion). We also used the trace plots and 
autocorrelation diagnostics reviewed by Cowles and Carlin 
(1996) to study convergence and we used the suggestion of 
Gelman, Roberts and Gilks (1996) to monitor the jumping 
probability in each Metropolis step in our algorithm. In 
performing the computation, centering the BMI values help 
in achieving convergence (see Gelfand, Sahu and Carlin 
1995). However, this is not quite a straightforward task 
because centering in the logistic regression affects the BMI 
part of the model as well. 

We obtained a sample of 1,000 iterates which we used 
for inference and model checking. By using the trace plots 
we “burn in” 1,000 iterates, and to nullify the effect of 
autocorrelations, we picked every tenth iterate thereafter. 
This rule was obtained by trial and error while tuning the 
Metropolis steps. We maintain the jumping probabilities in 
(0.25, 0.50); see Gelman et al. (1996). 
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3.3. Model Selection and Model Assessment 


We used the minimum posterior predictive loss approach 
(Gelfand and Ghosh 1998) to select the best model among 
the first four. 

Under squared error loss the minimum posterior 
predictive loss is 


Dep ae 
K+ 


P= 2 Var(xi" ee ), G= py {ECR |x ny iy 


Y 
UT] 


where Rane (or Seip ue. \Q)r(Q|x°)dQ and 
x; are the ‘predicted values and Q is the set of all 
parameters. This measure extends the one obtained earlier 
(Laud and Ibrahim 1995), and we have taken k =100 to 
match this earlier version. Note that for the nonresponse 
application, these measures are computed only on the 
complete BMI data after fitting our nonresponse models. 

In Table 1 we present the deviance measure (D,,.) and 
its associated components, goodness of fit (G) and the 
penalty (P) for the four models. Using the deviance measure 
the selection model is much better. While P is roughly the 
same, G is much smaller, making D,,.. smaller for the 
selection model. The difference between the two pattern 
mixture models are more pronounced than the difference 
between the two selection models. However, because 
standard errors are not available, it is difficult to tell the 
strength of the difference. 


Table 1 
Comparison of the Ignorable, Pattern Mixture and the 
Selection Models Using the Deviance Measure 


Model G P Dat 
SEI 135 135 270 
SE 118 135 253 
PMI 268 135 403 
PM 204 135 339 


Note: Dioy = G +(100/(100 + 1)) P where G is a goodness 
of fit, P a penalty and D the deviance; the pattern 
mixture (PM) model and the selection model (SE) are 
both nonignorable. SEI is ignorable version of the 
selection model, PMI is ignorable version of the pattern 
mixture model. 


Next, we look for deficiencies in the selection model. We 
use a Bayesian cross-validation analysis to assess the 
goodness of fit of the selected model (ie., the selection 
model). We do so by using deleted residuals on the 
respondents’ BMI values. 

Let (%,;;,1%j;)) denote the vector of all observations 
excluding the (ij) observation (x, r,,). Then, the (ij ee 
deleted residual is given by 


DRES;; = {Xj — EQxy |X) My} / STD G1 XG -May)- 


ij? 
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These values are obtained by performing a weighted 
importance sampling on the Metropolis-Hastings output. 
The posterior moments are obtained from 


F (%y Xap NaH) = J fj1Q)RQ] x6 1,5))dQ. 

For the pattern mixture model 
f(x; |Q) = f(x; | 7, =0, Q) py; 
. f(x, | Vj = i Q) py, 


and for the selection model 


=0]Q) 
=1]Q) 


Ff (x; |Q) ~ Normal {(z;, @; +V9;) + (2; @y +V4;)4j 553 }. 


We also considered using the conditional posterior 
ordinate (CPO) which is f(x, | x,),7j;)) evaluated at the 
observed X;j- However, these CPO’ s lead to similar results 
for identifying extremes. 

We drew box plots (not shown) of DRES versus the four 
levels of race-sex and the thirty five counties, and they 
showed that the selection model fits well. We drew box 
plots of DRES versus age and, interestingly, we found a 
pattern. Age class 2-4 seems to fit well; the predicted BMI 
values are somewhat high for age class 5—9; and age classes 
10-14 and 15-19 have larger variability. We look at the 
box plots of DRES versus age even further by separating 
out the box plots for 18 (i.e., 2-19 years old) individual ages 
(see Figure 1). Ages 11-19 fits well, but there is a problem 
with ages 2-10 (i.e., a downward curvature in the medians). 
The other three models show similar patterns. A further 
refinement of the selection model in section 5 fixes this 
problem. 


4. Estimation and Prediction 


In this section we perform an analysis on the NHANES 
fil BMI data for children and adolescents (i.e., 2-19 years 
old). We use the selection model, and then as a means to 
study sensitivity, we compare prediction under the non- 
ignorable nonresponse selection model with that of the other 
three models. 


4.1 Estimation 


We have studied the relation between BMI and age using 
95% credible intervals for the parameters in the selection 
model. First, the interaction of race and sex is not important, 
but as expected there is an important relation of BMI on 


age. BMI increases substantially with age (95% credible — 


interval for &,, is (11.89, 13.67)). The rate of increase for 


| 


white males is smaller (95% credible interval for O,, is — 


(—2.30,-—0.19) and the 95% credible interval for a,, is | 


(—3.03, —0.64)). Thus, while BMI increases with age, 


there is relatively less increase for white males. Apart from 
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the parameter 9,, which indicates strong nonignorability, 
the other parameters are essentially unimportant. For 
example, the 95% credible intervals for p, and p, are 
(—0.53, 0.39) and (—0.45,0.45) respectively indicating 
that a simpler model can be used (i.e., 9, =P, =0). 

We take up the issue of ignorability further. We drew 
box plots (not shown) of the posterior densities of the B,,, 
obtained from the iterates from the Metropolis-Hastings 
sampler, by county. All the box plots are above zero. This 
suggests that the nonresponse mechanism for each county is 
nonignorable. In addition, there are varying degrees of 
nonignorability. For example, several counties have the 
medians of the box plots near 1.5 while others have them 
near 2. 


4.2 Prediction 


It is desirable to predict the finite population mean BMI 
value and the proportion of respondents in the finite 
population. The sampled nonrespondents’ BMI values are 
obtained through their conditional posterior densities 
included in the Metropolis-Hastings sampler. The non- 
sampled BMI values are to be predicted. 

It is worthwhile noting that our models are applied to the 
logarithm of BMI with each individual having her/his 
covariates, and so the logarithm of each individual non- 
sampled value has to be predicted and then retransformed to 


Figure 1. 
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the original scale. However, the computation is reduced 
considerably because age, race and sex for each nonsampled 
individual is not known, but the number of individuals in 
each age-race-sex domain is known in the U.S. population 
by county. 

The distributions of the nonsampled individuals are 


srl) = | f(xy, rylQM(Q|x™, dQ, 


“st 


Lae oe) mesiete le... 
we have 


N,. For the pattern mixture model 


f(x;; 7,,|Q) = f (x; | 7, Q) p(7; |Q) 
and for the selection model we have 


where Q denote the set of all parameters. 
Therefore, if we take a sample of size M from the 
posterior distribution, {Q” :h=1,...,M}, an estimator for 


Ce Fadl) 
a MS Fj.4)10), 


Thus, we can fill in the xij and hij for each Q™ obtained 
from the MCMC algorithm from which we get M 
realizations yea Pe h=1,...,M. Inference can now be 
made about X, in (1) and P. in (2). 


oe na BSN 


sem orne 


AGE 


Box plots of the cross-validation residuals (DRES) by 


age for the selection model 
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We present 95% credible intervals for the finite popu- 
lation mean (FPM) BMI value and the finite population 
proportion (FPP) responding in order to judge sensitivity to 
the four models. Note that we provide these intervals for 
each domain: race by sex for each age class by county, and 
because they are very similar across domains we have 
presented in Table 2 the average of the end points of the 
credible intervals over county for black females only. The 
intervals for the FPM across the models are very similar. 
However, those for the FPP are very different. The intervals 
for the pattern mixture model and its ignorable version are 
similar except for age class 2-4. This is expected because 
these models express a linear regression of the logarithm of 
the odds of responding on age. The intervals for the FPP 
under the two pattern mixture models are essentially the 
same because they have the same relation with age, race, 
sex and their interaction. The intervals for the ignorable 
version of the selection model are all the same over age 
because in the response part of this model both age and BMI 
are ignored. We note that the intervals for the selection 
model have forms similar to the pattern mixture model and 
its ignorable version. As the intervals indicate, the FPM and 
FPP increase with age. 


5. A Spline Regression Model 


We now address the issue associated with the box plot in 
Figure 1. We have a further look at the observed data. A box 
plot of observed BMI values versus age shows that BMI is 
roughly constant for ages 2-8, then rises roughly linearly 
for ages 8-13, and finally rises very slowly for ages 14-19. 
This apparently important feature is not included in the four 
models. Thus, in this section we attempt to exploit this 
feature using a spline regression model. 

We have used Part | of the selection model, and for Part 
2 we use a join-point regression model. Generically, letting 
c’ =0 if c<0 and c’ =c if c>0, wetake 


X= Oo, + Oy (G,—8) +O. )d,—13 Fe. (9) 
where in the spirit of our four models 


yi = 2 Oe TV eG» k =0,1, 2. 
In (9) we have taken 
> idd ? 
ei |o; ~ Normal(0, 03) 


and motivated by our earlier result (the v,, are 
uncorrelated), rather than a trivariate normal density on 
V; = (Vy; Voj> V3;), We have taken 


idd 
v,; |; ~ Normal(0, 67), k =0, 1, 2. 


The distribution assumptions on the hyper-parameters 
remain unchanged. 

We have computed the deviance measure for the spline 
model; see Table 1 for the other four models. For this model 
G=129 and P=107 compared with G=118 and 
P =135 for the selection model. That is, D,o) ~ 236 for 
the spline regression model and D,, =253 for the 
selection model. Thus, the spline regression model shows an 
improvement over the original selection model. 

In Figure 2 we present box plots of DRES versus age. 
This is a much improved plot over the one for the selection 
model (see Figure 1). Observe that the medians fluctuate 
about O with very little variation. The box plots for ages 2, 3, 
4,5, 6 and 7 are a little less variable than the others. We also 
fit the quadratic join-point model in which we replace (9) by 


Xz =Dox+ Ory (Gy —8) +O, ((@, -13)"F +e, 


with all other assumptions remaining unchanged. This 
model did not show any substantial improvement over the 
alternative model specified by (9), which we retain without 
further refinement. 


Credible Intervals for the Finite Population Mean BMI (FPM) and Proportion (FPP) Responding for Black Females 


Table 2 


Comparison of the Four Models Based on the Average Over All Counties of the End Points of the 95% 


age 
Model 2-4 5-9 10-14 15—19 
SEI FPM (14.80, 16.07) (17.09, 18.58) (19.63, 21.61) (22.40, 25.19) 
FPP (0.73, 0.79) (0.73, 0.79) (0.73, 0.79) (0.73, 0.79) 
SE FPM (1555; 1621) (17.49, 18.36) (19.52, 20.92) (21.74, 23.91) 
FPP (0.66, 0.78) (0.71, 0.81) (0.75, 0.84) (0.78, 0.87) 
PMI FPM (14.75, 16.10) (17.04, 18.59) (19.59, 21.55) (22.42, 25.09) 
FPP (0.49, 0.70) (0.72, 0.84) (0.84, 0.94) (0.90, 0.98) 
PM FPM (14.96, 15.79) (17.16, 18.38) (19.61, 21.45) (27,37,25:07) 
FPP (0.49, 0.70) (0.73, 0.84) (0.84, 0.94) (0.90, 0.98) 


Note: SEI is ignorable version of the selection model, PMI is ignorable version of the pattern mixture model, 
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PM is pattern mixture model, and SE is selection model. 
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Figure 2. Box plots of the cross-validation residuals (DRES) by 
age for the spline regression model 


In Table 3 we compare the FPM for the selection models 
(regression without splines and regression with splines). 
Again we average the end points of the 95% credible 
intervals over all counties. The intervals overlap suggesting 
similarity between the model without splines and the one 
with them. However, there are some exceptions. The largest 
difference between the intervals occur for individuals age 
15-19 years old. In general, the spline model provides 
higher precision. For example, for age 10-19 the intervals 
for the spline model are contained by those for the model 
without the splines. 


6. Conclusions 


To analyze BMI data from NHANES III by age, race and 
sex within each county, (a) we have extended the normal- 
logistic regression model to a hierarchical Bayesian 
selection model, and (b) constructed a pattern mixture 
model and two ignorable nonresponse models to assess 
sensitivity to inference. A deviance measure shows that 
among the four models, the selection model is the best, and 
a cross-validation analysis shows that these models fit 
roughly equally well. 


Table 3 
Comparison of the Two Selection Models (Regression Without Splines and Regression with Splines) using the Average 
Over all Countries of the End Points of the 95% Credible Intervals for the Finite Population Mean BMI by Age, Race and Sex 


age 

R-S 2-4 5-9 10-14 15-19 
BF No Spline (16.26, 16.92) (16.44, 17.10) (19.62, 21.41) (21;35,,25.62) 
Spline (15.65, 16.31) (17.62, 18.41) (19.70, 20.91) (21.95, 23.82) 
BM No Spline (16.10, 16.76) (16.26, 16.92) (18.83, 20.55) (20.45, 24.53) 
Spline (15.68, 16.32) (1732, 13.41) (19.03, 20.21) (20.84, 22.61) 
OF No Spline (16.39, 17.00) (16.56, 17.17) (19.48, 21.19) (21.16, 25.39) 
Spline (16.01, 16.60) (17.77, 18.54) (19.62, 20.79) (21.61, 23.38) 


OM No Spline 


Spline (16.16, 16.74) (17.74, 18.51) (19738520,59) (O93. 22.6)) 


(16.53, 17.14) 


(16.67, 17.29) 


(19 22,,20.95) 


(20.83, 24.98) 


Note: R-—S is race-sex: BF is black female; BM is black male; OF is non-black female; and OM is non-black male. 
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Another contribution is the identification of a common 
deficiency in the selection model, the pattern mixture model 
and the two ignorable models. Based on the observed data, 
we have found that there is a dynamic relationship of BMI 
with age. Thus, we have further extended the selection 
model to include three linear splines. The cross validation 
analysis shows that there is an improvement over the 
selection model, and in fact, the deviance measure shows 
that the linear spline regression model is the best among the 
five models. 

Our study on obesity is one of the key contributions in 
this work. The linear spline regression of BMI on age 
adjusting for race and sex, gives a better fit and improved 
precision than the selection model without splines. It is not 
easy to construct a model that is satisfactory for all aspects 
of the NHANES III data simultaneously. We have been able 
to do so for children and adolescents. BMI increases 
substantially with age; race and sex contributing negatively 
to this increase; there is relatively less increase for white 
males. In general, the effects of race and sex are relatively 
minor. There is some variation across the thirty five 
counties. 


Appendix A 
The Pattern Mixture Model 


For Part 1 of the pattern mixture model the response 
depends on age, race and sex, and the interaction of race and 
sex through the logistic hiccal 


i |B, ~ = Bernoulli 
jee Puer Bara Bat 2+ BaiZys 11+ Por Priair*Baizin Paice +Baizy ) (A.1) 


i ee er = 9 ee Now, letting, B, = 
(ba. BS Besa. B..Y, note that while the vector B, has 
p =5 components, the corresponding vector in (4) has two 
components. Analogous to (4) we take 


iid 
B,|@, A ~ Normal(@, A), 


(A.2) 

and for the prior distribution, 

8 ~ Normal(6’, A’) 
and A” ~ Wishart{(vA®)*, v}, y = p, (A.3) 
where 0°, A, A and y are to be specified. Part 2 of 
this ven Fe BMI incorporates a dependence on the 
response indicators, letting wj9 =1, w,, =4,;, 

1 

=) @ Si eee.) Wee Deas | ij =0, 1, 


t=0 


ee lo? " Normal(0, O03). (A4) 
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The distributions on the (v);,v,;) are the same as in (7). 
The prior distributions are exactly those in Part 2 of the 
selection model (i.e., see (6) and (7)). 

We take v'?=2p, a value that indicates near 
vagueness, maintains propriety and permits stability in 
computation. We show how to specify parameters like 
OO Ae aA, RET; 2,3, AW” m Appendices 


Appendix B 
Metropolis-Hastings Algorithm for Fitting the 
Selection Model 


For the nonignorable nonresponse selection model the 
joint posterior density is 


(s,nr) 2 (s.r) 
pa .6~0,8,¥,0,):,P2|4-° ) 
7 Il 1 a —{Zjj (@, +402 )+V9; +V4;4 me Por Pury 
—e 3 pe ees 
' +6. x. 
i=l yal 0, I+eP Bix; 
TT Il 1 BgF CH (2h (Or tegen + Vor 4 May))" ] 
jl jal 0; Je Poe tBu 
] 
I] 
m O70,.1—p, 


sal (tae) ap (Ba | But) (Bact) 
2A1-p7)}\ AVS; oP) rom 
e 
a eee Cac 


2 | 5f a | 9-0 yA-1(6 9) 
20; 2 7 
ATES] 6 He 
k=l O; 


1 , 
-5(a,-a ya,” hu, 25 


(B.1) 


Let © denote the set of parameters B, 9, vy, a, 


03, W,, yw, and x” where w,=(0;,03,p,) and 
W,= (0;,62,p,). Generically, let Q, denote all 
parameters in 2 except a; for example, QQ, = 
(0,¥,0,05,W,,W5,x°””), so that the conditional 
posterior density (CPD) of 6B is_ denoted’ by 
P(B|Q,,x°”). To perform the Metropolis-Hastings 


algorithm, one needs the CPD for each parameter given the 
others and x”. Here we give a sketch of the algorithm. 

The CPD for each of the parameters 0,v,a and 6; is 
easy to write down. But we need Metropolis steps for the 
CPD’s of B, w,,w5, and x”, 
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Conditioning on ,, the parameters B,,...,B,, are 
independent with 


n; (Boi + Bix) Fy 


1 7-1 
—>(B;-8) A; (B;—6) 
(s.r) 2 
VY oc SS Se RE 
P(B; | ) I] he e (Bor*Buxy) ; 


where 


2 
A -( 0; P; O; | 
ee 2 
Pp; 0, 0, 0, 
and x. 


pl=l est and j=7. +1,..., 7, ate Ww be predicted; 
see below. We use a technique based on logistic regression 
to obtain a multivariate Student’s t proposal density in 
which tuning is obtained by varying its degree of freedom. 

The method to draw from the CPD’s of yw, = 
(o7,65,p,) and yw, =(63, 0%, 1,) is the same. The CPD 
of y, is 


a oe 
1 2 Dh oe 
G16: 

495 


1 1 we 29) u 1 Y 
1 ae pane Yi for ie vorit—z Mh 
2(1-p3) [04 i=1 i=l 05 i=1 


“ ec oye a 
2 


We have used the Fisher’s z transformation (see Ruben 
1966) to obtain a proposal density associated with normal 
distribution for log{p,/(i—p,)} and gamma distributions 
for 0% and o%. 

Finally, we consider the Metropolis step for drawing 
xeon) LQteee a We note that in this CPD, 
x;,i=1,...,1, j=7,+1,...,n,, are independent with 


P(W, |, sa: ) ce [ 


ij? 


1 2 
— Ly {2 (@ +445 +19; 44,4; }] 
(s,r) Dhow, e 
DG, |Qy, x )oce 


Kee a! 
{1+ fot i} ; 


We have constructed a proposal density using least squares 
techniques. We note that the proposal density 
Normal (z,,(@, + 4,@)) +o; +¥,4;,03) did not perform 
well (see Chib and Greenberg 1995). 


Appendix C 
Specification of Hyperparameters 


We discuss how to specify the hyperparameters 
(0, AO) and (a\”, Gee ),k =1, 2, associated with @ and 
a,,k =1, 2 in the selection model. 

First, consider 0, A), | 50 aie Rw ey ey a Bean | 
fit the logistic regression model 1, “ Bernoulli 
{ePo Pury Cy 4 ePo*Bity 1) Where x, are obtained by 
prediction (see Appendix A). Letting B,,i=1,...,/ denote 
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the least squares estimators, we assume _ that 
ew ek ~ A 
B, ~ Normal (6, A’) to get 0 =1/1 >, B, and 


ES i gh i x 
Ani pad > (B; —O)) (B; — 8) (C.1) 
i=] 


and we set A = «,A®, where x, is to be selected. 

Next, we consider how to specify (a\”, T°), k =1, 2. 
We fit x, =z; (@,+@,a,)+e,,, where a, is the age of the 
j" individual in the i" county, i=1,...,/, j=1,...,n, to 
get least squares estimators, &@ = (@,, &,) and its covariance 
matrix I. We set a =4,, and KO =«, 1%, where 
he, k=1,2 is the corresponding block matrix of 
1, k =1,2 and k, is to be specified. 

We have experimented with k, in (C.1). We used 
kK, =100 to provide a proper diffuse prior; a value of 
k, =1,000 did not change our predictions. Similarly, we 
used «, = 100. 
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Towards Nonnegative Regression Weights for Survey Samples 


Mingue Park and Wayne A. Fuller ' 


Abstract 


Procedures for constructing vectors of nonnegative regression weights are considered. A vector of regression weights in 
which initial weights are the inverse of the approximate conditional inclusion probabilities is introduced. Through a 
simulation study, the weighted regression weights, quadratic programming weights, raking ratio weights, weights from logit 


procedure, and weights of a likelihood-type are compared. 


Key Words: Raking ratio; Maximum likelihood; Quadratic programming; Simple Conditionally Weighted (SCW) 


estimator. 


1. Introduction 


In survey sampling, information about the population is 
often available at the analysis stage. One method of using 
this information is through regression estimation. There are 
a number of ways to construct a regression estimator of the 
population mean or total. One regression estimator of the 
mean is 


Vow = LW; Yi =n + Ky —Xq) B, (1) 


i=1 
where 


24 
W; =O; +(Xy -5(3 x; OF s] i Oi: i 2) 


yon 


= 
n> x)=(Sm'| Date Gy, X;) = > ; (Yi. Xi), 
i=l i=l i=l 


# 
te s a 0; y x 0; Yi» 


i=l p= 


® = diag(Q,,, ..., 0,,) 18 a nonsingular diagonal matrix, 
the m,’s are the selection probabilities and x, is the 
population mean of x. A possible choice of 6; is a. A 
review of the use of such information in regression esti- 
mation for sample surveys is given by Fuller (2002). 

It is well known that regression weights that are used to 
define a regression estimator such as (2) can be very large or 
(and) can be negative. If the regression weights are to be 
used to estimate a finite population total in a general pur- 
pose survey, it seems reasonable that no individual weight 


should be less than one. Also, it seems reasonable, on 
robustness grounds, to avoid very large weights. 

There are several ways to construct regression weights 
with a reduced range of values. Huang and Fuller (1978) 
defined a procedure to modify the w, so that there are no 
negative weights and no large weights. Husain (1969) sug- 
gested quadratic programming as a procedure to place 
bounds on the weights. Quadratic programming and a 
number of other procedures build on the fact that the 
weights can be defined as values that optimize some 
function. Deville and Sarndal (1992) considered seven 
objective functions that can be used to construct weights. 
They suggested objective functions that can be used to 
produce weights which fall within a given range. Deville, 
Samdal and Sautory (1993) introduced the program, 
CALMAR, written as a SAS macro that can be used to 
calculate weights corresponding to four different objective 
functions when auxiliary information in the survey consists 
of known marginal counts in a frequency table. 

Another modification of regression weights is to relax 
some of the restrictions used in constructing the estimator. 
Husain (1969) considered modifying weights for a simple 
random sample from a normal distribution. He derived the 
weights that minimize the mean square error (MSE) of the 
resulting estimator. Bardsley and Chambers (1984) con- 
sidered an estimator based on an objective function and the 
division of the auxiliary variable into two components. They 
studied the behavior of the estimator from a model 
perspective. Rao and Singh (1997) studied an estimator in 
which tolerances are given for the difference between the 
final estimator for part of the auxiliary variables vector and 
the corresponding elements of the population vector. 

In this paper, we consider different types of regression 
weights including a procedure based on Tillé’s (1998) con- 
ditional selection probabilities. The approximate conditional 
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inclusion probabilities are used to compute regression 
weights that are positive for most samples. These regression 
weights are compared to raking ratio weights, to quadratic 
programming weights, weights from logit procedure, and to 
weights based on a likelihood-type objective function. 


2. Maximum Likelihood and 
Raking Ratio 


Consider a two-way table with r rows and c columns. 
The population cell U, contains N,, elements; i=1, ..., 
r, j=1,...,.c. Assume marginal counts N,, N, are 
known. The population characteristics of interest are the 
N,, or, equivalently, p, =N i N,,. For a simple random 
nonreplacement sample of size n, Deming and Stephan 
(1940) suggested a raking ratio procedure to get the solution 
for the cell frequencies. See also Stephan (1942). If we 
assume the sample is a random sample from a multinomial 
distribution defined by the population entries in a two way 
table, we can construct an estimator using the maximum 
likelihood procedure. 

Deville and Sarndal (1992) defined a class of calibration 
estimators, y.,,, Of the population mean of y as 


Fear = Wi Ys 3) 
i=1 


where the minimize the objective function 
7-1 G(w,, &;) subject to constraints 


w,’S 


yw, =1 and >) w, x) = Xy, (4) 


i=] i=] 


and G(w,, O;) is a measure of distance between an initial 
weight o, and a final weight w,. The raking ratio and 
maximum likelihood estimators of the population cell 
fraction, Pij> belong to the class of calibration estimators. 

The raking ratio weights for the population cell fraction, 
with a simple random sample, can be obtained by 
minimizing 


3 W, toe “4 - w, tn", (5) 
oF n 
subject to the constraints (4) with 
X, 60,5 “009 é.., O46 EX ) Sas (6) 


where 8, =1 if k" element belongs to the i” row and 
5, =0 otherwise, and 6 , =1 if k" element belongs to the 
j" column and 6,=0 otherwise. The raking ratio 
estimator for the population cell fraction p,, is the estimator 
(3) where y, =1 if the k element belongs to cell ij and 


y, =O otherwise. 
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For the maximum likelihood estimator of the population 
fraction, with a simple random sample, Deville and Sarndal 
(1992) suggested minimizing 


er * og{ i) tm a0 * (7) 
k=1 
subject to (4) with x defined in (6). 

Chen and Sitter (1999) suggested a pseudo empirical 
likelihood estimator. They defined the population likelihood 
of y, as 


N 
Ds logw; y, (8) 
p= 


where w; ,, is the density at observation y,. With a sample 
of size n, they suggested the pseudo empirical likelihood 


. estimator of the form 


vere Dy W; Yi» (9) 
i=1 
where w,’s are obtained by minimizing the function 


~)) nj! logw,, (10) 


i=l 
under the restrictions (4). The resulting w, are equal to 
those obtained by minimizing (7) with =N7. under the 
restrictions (4). 

Deville and Sarndal (1992) showed that the raking ratio 
and maximum likelihood estimators are approximately 
equal to a regression estimator of the form (1), and, hence, 
have the same limiting distribution as the regression 
estimator. Weights for the raking ratio and maximum 
likelihood estimators are nonnegative if the solutions for the 
weights exist. 


3. Weighted Regression Using 
Conditional Probabilities 


Tillé (1998) suggested the use of approximate 
conditional inclusion probabilities, conditioning on the 
Horvitz-Thompson estimators of auxiliary variables, to 
compute an estimator for the population mean of the study 
variable. His approximation can be extended to produce 
regression weights that are nonnegative with high 
probability. 

Assume that the vector of population means of auxiliary 
variables, X,, is known. Consider the Horvitz-Thompson 
estimator of X, given by 


n 


1 x 
Sa gs oa (11) 
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where X; =(X;,-.-, X,,) and 7, is the unconditional 
inclusion probability. Tillé (1998) introduced the simple 
conditionally weighted (SCW) estimator, 


te 1a y,; 
=—>y —t, 12 
ee esha (12) 


| Xr 


where 1, is the conditional inclusion probability of the 
i element conditioning on X,,;. To construct the SCW- 
estimator of y,, the conditional inclusion probability 


T,)z,, 18 required. If X,,; takes the value t, we have 


i|x 
P{Xy, =tlie A} 
P{Xp, =t} 


Tl. 


ile = Ti (13) 
where A is the set of indices for the sample elements. 

In order to compute the conditional inclusion probabil- 
ities, it is necessary to know the probability distribution of 
Xr unconditionally and conditionally on the presence of 
each unit in the sample. Except for some particular cases, 
this probability distribution is very complex. For this reason, 
approximation of the conditional inclusion probability is 
considered. 

Under the assumption that x,,, has an approximately 
normal distribution unconditionally and conditionally on the 
presence of each unit in the sample, the conditional inclu- 
sion probability (13) can be approximated by 


fie =% Lez| Lez,co| 
exp {0.5 (Gz; —G; (i) }, (14) 


1/2 eal ed, 


where > -- ae aly 9 = Var {X,,|F, ie A}, 


XE) 


zx = (Xyr 4 Ie 


(Xur — Kak 


= , 
G55 (i) = (Kur — Xy, (i) De zc) (Xyr — Xy,ci) > 


Xw (i) = E{Xuy|F, ie A}= 


N 
Nera xf NTS, Gem att Xs 
e 
A is the set of indices appearing in the sample and 
F ={y,,..., Yy} 1s the finite population. Tillé (1998) gives 
an expression for 2); ;;, for the general case. 

Assume the design covariance matrices 2 ,; and diss 
are positive definite and assume the vector of auxiliary 
variables is normally distributed. Tillé (1999) showed that 
the SCW-estimator defined in (12) with the approximate 
conditional inclusion probabilities of (14) satisfies 


Yon = Yur + (Xy — Xur) By Oo. (n™) (15) 


Riecst Only ds (16) 
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where 
By iss Ne Diss , 
Vee = Yur + (Xy — Xp) B, 
B=y21 Des = (x’o"! xy X’ o" y, 
X=(x),... X,), Y=O,, «5 Y,)> the i” element of 


Oo” is N“(, 1; %;) (My —%j%;), Lzz-is the design 
variance Of X47, 2; is the design covariance of X,,; and 
Yur» 2uzz is the Horvitz-Thompson variance estimator of 
Xr, and 2z5 is the Horvitz-Thompson estimator of the 
covariance of X,, and Yr. 

Given a complex design, a number of the quantities in 
(14) are difficult to compute. However, approximations 
giving the same large sample properties for the estimator are 
relatively easy to compute. We replace )}-. and >. 
with estimators, replace Xv (;. with x, +d, , define 


Kass) 


M;;= >. (Nn,) +d), y;, (17) 


ig A 
and assume 


Var{n(M,;-M;;)}=O(n"), (18) 


d..=O0,7 .), (19) 


where d, is a function of the sample and M;; is a 
population quantity. Often M;; is the population covari- 
ance matrix 21;;, but this equality is not required in order 
for the estimator to be well defined. In many cases one can 
compute d, as a multiple of the jackknife deviate. Also in 
many situations, an adequate value for the estimator, 
Dee Ol ea a Ney We write our gener- 
alization of (14) as 


st A 1/2 -1/2 
Nig, — Mi |Ozx | XE, (i) 
Sale ea, 
where 
G.= = (Xyp —Xy) ye (Xie —Xy)- 
G6 = (Xyr — Xy d. ) pet (Xyr — Xp St i 
Let the estimator (12) constructed with the 71,,- of (20) be 
Y vit iN Sree Jie (21) 


i=] 


An approximate conditional inclusion probability with a 
simple random sample and a single auxiliary variable is 
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where 


d, =[n(N -1J'(N =n) @, —Xy), 


wa (Naa) (=D) 9 NG =%) | 
= (0 ara a) OME Ee A aye 


C= (iN i, 
and 
5 =) DEG aks 
i 
In this case, d, =Xy (j) —Xy and M,; =Cov(Xpr, Yur): 


The SCW-estimator (21) with the approximate 
conditional inclusion probabilities is not calibrated, that is, 
the estimator (21) for the mean of the vector of auxiliary 
variables is not the vector of population means. It is 
relatively easy to standardize the probabilities so that they 
sum to one or sum to the stratum fraction for stratified 
sampling. To construct a calibrated estimator for the general 
case, we aueecet computing the regression estimator with 
Peer te es ]'7;);,. aS initial weights. The suggested 
estimator is 


a Loo? 2) 
i=l 
where 
(Ves x)=>. a; ();. X;) 
i=l 
a n # nh 
(B. 6: Bar) -|5 QO; Z; | ps Q; Z; »} 
i=l i=l 
Lh = (axe x) 
=j 
lie Ss Rhaa | # Tig 
w.=Q 


=i 
+ (Ky =p 0; (K,; -¥,) (x, -*) OL, (x, -¥,Y, 


and 1, is the approximate conditional inclusion 
piobabilite: of (20). We assume the vector of auxiliary 
variables contains one so that the estimator is location 
invariant. 
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The estimator (21) is approximately equal to a regression 
estimator and estimator (22) is also approximately equal to 
the same regression estimator. 


Theorem: Let a sequence of populations and samples, 
{Figs Ay f> SALISLY 


(Yur> ¥ur)— (Yn» ¥y) =O, (0). (23) 


Assume that the sequences of estimated covariance 
matrices, 2). and ee (> Satisfy 


[D2 ¥.- D7? |" 


= lp" pat p?}"= O, (n"), (24) 


where D denotes a diagonal matrix having the elements of 
the diagonal of ee on its diagonal. Let d, be a function 
of the sample Siete (19) and assume (18) holds. 
Assume the sequence of Horvitz-Thompson variance 
estimators satisfies 


var{ n[vech(.. icatisane | = O(n"), (25) 


where z; =(X;, y;) and }/.. is positive definite. Assume 
Elf. } is bounded, where 7,;, is defined in (20). 
Then, the SCW- estimator y,; of (21) satisfies 


Vie Var Rye Xap) Op Op ne) 


= Yup + (Ky —Xyr) 8+ 0, (0), 


where @= =! M.. and 6, =D! M;; 

If Var{d?_,71;'}>0, assume x, contains one as an 
element. Assume M,,;=2.;. Then the weighted 
regression estimator of (22) satisfies 


Dae ms Yur +(Xy —Xyr) OF Ol Gis) 


For proof, see the appendix. 

To illustrate the nature of the different types of regression 
weights, we selected a simple random sample of size 40 
from a normal population with mean zero and variance one. 
The sample mean is —0.614 and the population mean is 
zero. The hese i the regression estimator is given by (2) 
with a, =o; =n '. The weights for the raking ratio and 
MLE are Bee By minimizing the objective functions (5) 
and (7), respectively, under the restriction (4). Weights for 
the SCW-weighted regression estimator are given in (22). 
The weights are plotted against the sample x values in 
Figure |. Five of the simple regression weights are less than 
zero because of the large discrepancy between the sample 
and the population means. All weights for the SCW- 
weighted regression estimator, MLE and raking ratio are 
nonnegative. Figure 1 shows that the behaviors of raking 
ratio and SCW-weighted regression weights are similar and 
that MLE has an extremely large weight in this sample. 
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Table 1 contains selected weights for the smallest x values, 
x values close to the sample mean, x values close to the 
population mean, and the largest x values. For the x—- 
values farthest from the population mean MLE gives the 
largest weights. For x—values near the sample mean the 
ordinary least squares weights are close to n™' while the 
other weights are less than n7'. The MLE weights are close 
to n' for x—values close to the population mean while the 
other weights are larger. 


Table 1 
Selected Regression Weights for Illustrated Example 
x Weights multiplied by n = 40 

Reg W. Reg Raking MLE 

= N08} — 0.56 0.12 0.16 0.40 
== 11h — 0.40 One; 0.20 0.40 
= R924) — 0.16 0.20 0.24 0.44 
— 0.710 0.88 0.68 0.68 0.68 
— 0.670 0.96 0.72 0.68 0.68 
— 0.468 1.16 0.88 0.84 0.76 
=) 103 1.52 1.28 1 a4 0.92 
0.021 1.68 1.44 1.40 1.00 
0.097 1.76 1.56 es) 1.08 
0.628 DEY) 2.60 2.60 1.84 
0.662 2.36 2.68 DL 1.92 
Wasi 2.96 4.60 4.88 One 

Simulation Study 


To compare the alternative methods of constructing 
regression weights we conducted a simulation study. A total 
of 30,000 simple random samples of size 32 were selected 
from a x distribution with two degrees of freedom. The 
parameters being estimated are those of the infinite 


0.20 


- Regression 
= - W. Regression 
= : Raking 
~ - MLE 
= 
= 
=o 
S 


0.05 


0.0 
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generating mechanism. Let x, be the value for the i” 
sampled element. Six estimation procedures were 
considered. 


1. Ordinary least squares regression (OLS) 


2. Quadratic programming with upper and lower 
bounds (QP) 
Weighted regression with SCW weights (SCW reg) 


3 

4. Maximum likelihood objective function (MLE) 

5. Raking objective function (Raking reg) 

6. Logit procedure with upper and lower bounds (Logit) 


The weights for the OLS estimator were calculated by (2) 
with o,=n'. The quadratic programming weights 
minimize "_, w; subject to the constraint 0 < w, < 0.065 
for all i and subject to constraints (4). The quadratic 
programming procedure is equivalent to the truncated linear 
method of case 7 of Deville and Sarndal (1992). Weights for 
the SCW weighted regression were calculated by 
minimizing 57_, ;' w; subject to constraints (4), where 
a, is defined in (22). The weights for raking and maximum 
likelihood were obtained by minimizing the objective 
functions (5) and (7), respectively, under the restriction (4). 
Weights calculated by the logit procedure minimize the 
function >°’_, G(nw;,) subject to constraints (4),where 


G(nw,) =a" Low) In(nw,) + (u—nw;) In Ao aes )} 


u—-l 


Figure 1. Comparison of four sets of weights. 
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if O<nw,<u and o elsewhere, a=u(u— 1)ideeand 
u = 2.08. Note that the solution for the logit procedure, if it 
exists, satisfies the bound restrictions 0 < w, < 0.065 for all 
i. The logit procedure was introduced as a case 6 in Deville 
and Sarndal (1992). As the upper bound for the weight, 
0.065 was used so that 3,026 samples (approximately 10%) 
have at least one raking regression weight greater than 
0.065. In 99 samples among 30,000, no solution for the 
quadratic programming and logit procedure is possible 
because no feasible point satisfies (4) and the bound 
restriction. For those 99 samples, the maximum of the OLS 
regression weights was used as the upper bound for the 
quadratic programming and logit procedures. 

Table 2 shows the average of the sum of squares for the 
six weights. The average weight is 1/32 = 0.03125 for every 
estimator. The least squares procedures have the smallest 
sum of squares of the weights because this is the objective 
function for those procedures. The least squares procedures 
also have a slightly smaller range in the sum of squares. One 
percent of the least squares samples have a normalized 
mean of squares greater than 1.401 while one percent of the 
mean of squares for raking are greater than 1.441. 


Table 2 
Monte Carlo Average of the Sum of Squares of the Weights 


OLS QP SCW MLE Raking Logit 
Reg Reg 
Average of _w’ w (x32) _1.043 1.044 1.045 1.053 1.045 _1.045 
Table 3 contains properties for the minimum of the 
weights. Maximum likelihood has the largest average 
minimum weight while the least squares procedures have a 
smaller average for the minimum weight. The variance of 
the minimum weight is largest for the ordinary least squares 
procedures. Note that QP permits weights that equal the 
lower bound of zero. 


Table 3 
Monte Carlo Mean, Variance and Quantiles 
of the Minimum Weight 
Quantiles (x32 ) 
Mean Variance 
Procedure (x102 ) (x105) 0.01 0.10 0.50 0.90 0.99 
OLS Deyo 6.46 —0.10 0.34 0.79 0.96 1.00 
QP 22 6.32 0.00 0.32 0.79 0.96 1.00 
SCW Reg 2.44 3.58 0.22 0.49 0.84 0.97 0.99 
MLE 2.45 2.79 033705202 0)83 20:97 00 
Raking Reg 2.36 3.81 0.20 0.45 081 0.97 1.00 
Logit 225 323 0.09 0.36 0.78 0.96 1.00 


Among the procedures without bound restrictions on the 
weights, the ordinary least squares procedure has smaller 
maximum weight on average and much smaller variance for 
the maximum. See Table 4. The SCW-weighted regression 
has a smaller fraction of very large weights than MLE or 
raking ratio but a higher fraction of large weights than the 
ordinary least squares procedure. The bounded QP and 
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Logit procedures have smaller mean and variance for the 
maximum weight than the procedures with no upper bound 
restrictions. 

Table 4 


Monte Carlo Mean, Variance and Quantiles 
of the Maximum Weight 


Quantiles (x32 ) 
Mean Variance 
Procedure (x102 ) (x105) 0.01 0.10 0.50 0.90 0.99 
OLS 4.25 17.35 1.00) ~ 1.03) “20 1292295 
QP ANG) 11.91 1:00) 1.03 1220) 19252208 
SCW Reg 4.56 26.42 1:03) M1 O7 ee D7, esa 
MLE 4.75 56.13 OO AOD EDS esi hh ALI. 
Raking Reg 4.46 30.25 1:00) 1.03. 91235209 3165 
Logit 4.13 10.23 LOO MEMOS 2 eS 2m Os 


To evaluate the performance of the procedures when the 
linear model does not hold, we considered estimation of the 
percentiles of the distribution function of x. Table 5 
contains the Monte Carlo bias of the percentile estimators 
where the table entries are 


[min{P, (1— P)}}'[E{P}-— P]x100, 


and P is the percentile. For example, the Monte Carlo 
estimated relative bias in the ordinary least squares 
estimator of the 0.01 percentile is -7.75%. The ordinary 
least squares estimator has the largest biases in estimating 
the population percentiles, among the procedures without 
bound restrictions. The MLE has the smallest bias for all 
percentiles except the 75", 95" and 99", where the 
SCW-weighted regression estimator has the smallest bias. 
For samples of size 32, many samples contain no 
observation greater than the 99" percentile. The QP and 
Logit procedures have larger bias than other procedures 
except for the 75" percentile. The biases of the QP and 
Logit procedures are relatively large for the lower 
percentiles. 


Table 5 
Monte Carlo Standardized Bias in Percentile Estimators 
Percentile Procedure 

OLS QP SCW Reg _ MLE Raking Reg Logit 
0.01 i SAS — De =: —4.70 -—8.30 
0.05 9H, = IS = SR) a Ss? 43) os 
0.10 = 6.60) =o leo S30 =I CHE 
0.25 = SSS S78) =1OS S38) — Sy, 
0.50 S30 a BAaG = Si Nag AIR SBS 
0.75 =2ZO P07 Sed), 277i = NOS, AIR 
0.90 4.60 5.31 2 022 Ole 5.68 
0.95 DATE. SV 38 6.01 6.41 9525 13.15 
0.99 32.94 32.36 19.03 22.66 26.65 30.03 


Table 6 contains the relative MSE of the percentile 
estimators where the table entries are 


[min{P, (l— P)}]°[E{P — P}* ]x100. 


Thus the relative mean square error of the OLS estimator of 
the 0.01 percentile is 283.27%. Although the OLS estimator 
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of the 0.01 percentile had the largest bias OLS has the 
smallest mean square error for the 0.01 percentile among the 
procedures without bound restrictions. The QP, OLS and 
Logit procedures are superior for the extreme percentiles 
while the other procedures perform better for the middle 
percentiles. 


Table 6 
Monte Carlo Relative MSE of Percentile Estimators 
Percentile Procedure 


OLS QP SCW Reg _ MLE Raking Reg Logit 
0.01 283.27 282.50 309.23 311.58 296.37 282.76 


0.05 53.91 54.23 Si Ale 70) 54.97 54.06 
0.10 Paes) Pe OH) 26.40 25.79 25.26 25.80 
0.25 8.00 841 el eo 742 8Al 
0.50 |S aes A 1B Salas Pa YBa) 2.42 
0.75 3:01 = 73.08 3.621 535.06 BGS = 3.07. 
0.90 14.50 14.60 14.25. 14.57 14.36 14.56 
0.95 39.40 38.65 40.99 41.66 39.93 37.94 


0.99 ZOOS 7 19624. 5023571 216.22 205.85 194.33 


In 562 of 30,000 samples at least one of the OLS 
regression weights is negative. In 17 of the samples at least 
one of the original SCW regression weights was negative. 
The use of quadratic programming with the OLS objective 
function (QP) to produce weights greater than or equal to 
zero and less than 0.065 increases the average sum of 
squares by less than one percent. See Table 7. Using 
quadratic programming to bound the SCW regression 
weights (SCW (QPL)) by zero increases the average sum of 
squares very little because there are so few weights that are 
changed. 


Table 7 
Monte Carlo Average of the Sum of Squares of the Weights for 
Samples with at Least One Negative OLS Weight 


SCW SCW Raking 
OLS QP —Reg(QPL) MLE —Reg 
Average of Ww’ w (x32) M2082 2267 227842 242 


Table 8 gives the Monte Carlo MSE for the 562 samples 
with negative ordinary least squares weights. The quadratic 
programming procedure is superior to other nonnegative 
weight procedures for the 0.01 percentile and is inferior for 
the 0.99 percentile. Of the 562 samples, 497 had a sample 
mean greater than the population mean. Recall that the study 
population has an exponential distribution. Because the 
weight on the largest observation is zero in the 497 samples 
there is a 100 percent error in the quadratic programming 
estimator of the 0.99 percentile for most of the 497 samples 
with a sample mean greater than the population mean. In 
sampling from a finite population the bound on the weights 
would be greater than or equal to N' and the MSE of the 
quadratic programming procedure for the 0.99 percentile 
would be reduced. 

Quadratic programming is superior to the other calibrated 
procedures for the 0.01 percentile in samples with negative 


oi 


OLS weights. Raking regression and SCW-weighted 
regression are superior to MLE for the 0.01 and 0.05 
percentiles. This is because MLE often has the largest 
maximum weight. 

Table 8 


Monte Carlo Relative MSE of Percentile Estimators 
for Samples with at Least One Negative OLS Weight 


Percentile Procedure 

OLS QP SCW (QPL) MLE _ Raking Reg 
0.01 281-526 291 350.58 461.80 344.06 
0.05 76.04 70.58 75.80 88.71 72.50 
0.10 44.80 40.74 39.31 38.84 36.05 
0.25 20.24 19.14 14.72 9.91 12.56 
0.50 5.03 S31 3.65 2.26 3.35 
0.75 S025) 453 3.36 4.24 3.45 
0.90 PETE weePe sXoh8) 20.04 18.80 20.49 
0.95 51.54 46.04 30.79 28.28 32.54 
0.99 206.33 90.08 39.40 57.54 43.49 


In 3,026 of 30,000 samples, at least one of the raking 
regression weights is greater than 0.065. In 2,152 samples, 
at least one of the OLS regression weights is greater than 
0.065, and in 3,209 samples at least one of the SCW 
regression weights is greater than 0.065. The use of 
quadratic programming with the OLS objective function to 
produce weights in (0.000, 0.065) increases the average sum 
of squares by 1.5 percent. Using quadratic programming to 
bound the SCW regression weights by 0.000 and 0.065 
increases the average sum of squares 0.8 percent. See the 
column for SCW (QP) of Table 9. 


Table 9 
Monte Carlo Average of the Sum of Squares of the 


Weights for Samples with at Least One Raking 
Reg Weight Greater than 0.065 


SCW SCW Raking 

OLS QP —Reg (QP) —Reg Logit MLE 

Average of w’ w (x32) _ 1.210 1.228 1.221 1.231 1.228 1.232 1.290 

Table 10 gives the Monte Carlo relative MSE for the 

3,026 samples with raking regression weights greater than 

0.065. The quadratic programming is superior to SCW (QP) 

and Logit for the 0.01, 0.95 and 0.99 percentile and the 

Logit procedure is superior to quadratic programming for 
other percentiles. 


Table 10 
Monte Carlo Relative MSE of Percentile Estimators for Samples 
with at Least One Raking Reg Weight Greater than 0.065 


Percentile Procedure 
SCW SCW Raking 

OLS QP —Reg (QP) —Reg Logit MLE 
0.01 139.96 130.53 173.86 146.40 124.02 173.65 206.65 
0.05 39.83 42.88 39.35 41.69 39.87 37.14 40.83 
0.10 26.31 30.92 2240 28.10 28.88 20.21 19.98 
0.25 EYES. ies TIOMISY TIS) iva 8.65 7.01 
0.50 3.95 4.87 B32.) Wael) Dod 3.03 2.28 
0.75 4:84 5.35 489 5.58 Soil 5.05 5.48 
0.90 2 9S 29045 28:70 29848 29:32" 228.79) 32:07 
0.95 TANS OU-O4 885.02) OS12— 165.98 83:13, 95,99 


0.99 198.77 179.58 219.16 181.17 172.45 212.38 226.73 
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Discussion 


We began the research with the conjecture that starting 
with the SCW weights in a regression estimator would 
produce weights that were almost always positive and that 
the weights would have desirable properties as measured by 
the ability to estimate the distribution function of x. To 
some extent these results support the conjectures. The 
minimum weights of the SCW regression are larger than 
those of OLS and comparable to those for raking. Quadratic 
programming can be used to remove the negative weights in 
the few samples with negative weights. If no upper bound is 
imposed, the maximum weights for the SCW weighted 
regression fall between those of least squares and raking. 

It is known that all of the procedures in our simulation 
study have the same order n™'’* properties. Our simulation 
and the study of generalized raking procedures done by 
Deville et al. (1993) indicate that there are also modest 
differences in small samples. No procedure is superior with 
respect to all criteria. Because of the poor performance for 
the extreme percentiles, we recommend against the use of 
the MLE objective function. The quadratic programming 
and Logit procedure produced weights with marginally 
smaller sums of squares, marginally smaller maximum 
weights, and marginally smaller MSE for extreme 
percentiles than the raking regression. The MLE, SCW 
regression and raking procedures give marginally larger 
minimum weights and marginally smaller MSE for the 
middle percentiles of the x distribution than quadratic 
programming and Logit procedure. The performances of 
quadratic programming and Logit procedures in estimating 
the distribution function of x are comparable. 


Appendix 


Proof. The ratio of the determinants of estimated covariance 
matrices in (20) is 


by assumptions (24) and (25). The difference Ge 
is 


ts (i 


hur Ey) (EF cy Gt ) ar —Ey) 
—2 (Kur —Xy) Los. @ A, +d, Lk wy 
By assumptions (23) and (24), 
exp{0.S{®ur —Xy) (Lyk w — Lak) Bur Ey) = 
1+0,(n™). (27) 
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Using assumptions (24) and (19), the Taylor expansion at 
d, =0 gives 
exp[-—(Xp; ss go) oREY «yd, +0.5.d., py) @ d‘] 
=1+4(Xy Kur) Lek wy i, +O, (07) 
=1+(Xy —Xpy) Led’, +O, ('). (28) 
Thus, by (26), (27) and (28), 


[NT; 


ilar 


J = (Nn) "1+ ®y —Kyr) Lied, ]+ 0, (0). 


By SUNEONS (18), (23) and (25), and using the fact that 
Ei... } is bounded, 
Va = Yur t (Xy —Xpr) +0, (0) 
= Yur t(Ky —Xyr) Oy tO,(n"). 29) 
If one is an element of x, or Var{X?_,7,'}=0, and if 


M,; =2,;;, the SCW-estimator for the population mean of 


vector q; =(1, x;) satisfies 
eal Di hie O; = Cs Xa) OU eo 


because the 6 for x is the identity matrix. By (30), 


-l 
(X,, so=n|5 Rite (X pa Y vt) 
i=1 
= (5 90g) t+ OLED 
Thus, 


Vie = ae +(Xy = gall) bbe 


= Y oft +(Xy =x) By +(y, —Viadt (5 =x be 
=¥~7+O,(n~) 
= Vir (ky Xe OE 


by (30), (31) and (29). 
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An Optimal Calibration Distance Leading to the Optimal 
Regression Estimator 


Per Gésta Andersson and Daniel Thorburn ' 


Abstract 


When there is auxiliary information in survey sampling, the design based “optimal (regression) estimator” of a finite 
population total/mean is known to be (at least asymptotically) more efficient than the corresponding GREG estimator. We 
will illustrate this by some simulations with stratified sampling from skewed populations. The GREG estimator was 
originally constructed using an assisting linear superpopulation model. It may also be seen as a calibration estimator; i.e., as 
a weighted linear estimator, where the weights obey the calibration equation and, with that restriction, are as close as 
possible to the original “Horvitz-Thompson weights” (according to a suitable distance). We show that the optimal estimator 
can also be seen as a calibration estimator in this respect, with a quadratic distance measure closely related to the one 
generating the GREG estimator. Simple examples will also be given, revealing that this new measure is not always easily 


obtained. 


Key Words: Horvitz-Thompson estimator; Regression estimator; Survey sampling theory. 


1. Notation and Basics 


Consider a finite population U consisting of N objects 
labelled 1, ..., N with associated study values y,, ..., yx 
and J —dimensional auxiliary (column) vectors X,, ..., 
x. We want to estimate the population total t, = Liev y; 
by drawing arandom sample s of size n (fixed or random) 
from U, with first and second order inclusion probabilities 
to— Pues), nt, — PG, pe s), iy ="1, 2, N. The 
study values and the auxiliary vectors are recorded for the 
sampled objects and before the sample is drawn we assume 
that at least ¢, =Sj-y XxX; 1s known. 

This is the standard setup for a regression estimator. In 
section 2 we discuss different regression estimators: the 
common GREG estimator (Samdal, Swensson and 
Wretman 1992), the optimal estimator (Montanari 1987, 
Andersson, Nerman and Westhall 1995) and calibration 
estimators (Deville and Sarndal 1992). It is well known that 
the GREG estimator can be obtained as a calibration 
estimator. In section 3 it is shown that this holds also for the 
optimal estimator, but with a more complicated distance 
measure. In the last two sections this and the optimal 
estimator are illustrated, first by theoretical examples and 
then by simulations. 

Finally some comments about matrix notation in this 
paper: Generally, the transpose of a matrix A is denoted by 
A’ andif A is square, the inverse (generalised inverse) is 
written A'(A_). We further let the column vectors 
Wee (ye) .20 ¢ ANd We =/ 1) se9 0% be, thea Jix,.n 
“design” matrix of the auxiliary information from s and 
finally 7,, means a unit diagonal matrix of size n. 


2. Regression and Calibration Estimators 


An unbiased simple estimator of t, is the Horvitz- 
Thompson estimator r =>j-,y,/%, =y’ wo. However, 
more efficient estimators may be obtained utilising the 
auxiliary information, e.g., the well-known model assisted 
GREG estimator, see Sarndal et al. (1992). For example, 
constructed from the assumption of a homoscedastic linear 
regression superpopulation model the GREG estimator is 


f,,=y'wot(y’ R, X")(X R, X7)'¢,-é,) (1) 


= (2) 


A 


to =DiesX;/; and 


where R, =w, I,, ft, 
e-(Lassrcx R, >: Sea hae (ae -i,)| 4 
- . 


l 


Now, the expression (2) for the GREG estimator is 
interesting since we also have that 


“el. (3) 


which is called the calibration equation. This brings us to an 
alternative possible derivation of the GREG estimator 
according to Deville and Sarndal (1992). Suppose that we 
seek an estimator y’w of t, with a vector w of sample- 
dependent weights (w;);.,, Which respects the corre- 
sponding calibration equation, while also minimising the 
distance between w and w, according to the quadratic 
distance measure 


1. Per Gésta Andersson, Mathematical Statistics, Department of Mathematics, Linkoéping University, SE-581 83 Linképing, Sweden; Daniel Thorburn, 
Department of Statistics, Stockholm University, SE-106 91 Stockholm, Sweden. 
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(w—w,)' R(w-w,), 


where R=(w, I). 
This results in 


w=w +R x’ (X R'X')'(t,-2,), (4) 


which means that w = g, since here R= R,_ oF 

Turning to the optimal estimator, consider first the vector 
(f i t ‘ ) and let > y.x be the covariance (row) vector of t ‘ 
and f, and >, , the covariance matrix of f,. Now, the 
minimum-variance, see Montanari (1987), unbiased linear 
estimator (in ie and f.) of t y is the difference estimator 


bet i, elle). (5) 


Since 1, , and Y 
optimal estimator be 


in practice are unknown, we let the 


Kare 


sae = y’ Wo sep ayn! Dae (t, —t) 


=O Re Ae COR x, 17 (ta 10) 


opt 
where R,,, =((%; — 7; T ,)MT;, 1; 1;)); jese 

In an_ asymptotic context, where n—co and 
No, X, , and 2), , may be viewed as components of 
the asymptotic covariance matrix of (/,, ¢/). Under the 
assumption of consistency of 21, , and >: _,» Which holds 
under very mild conditions, see Andersson et al. (1995), the 
optimal estimator has the same asymptotic variance as the 
difference estimator (5). In particular it follows that the 
optimal estimator is asymptotically better than the usual 
GREG estimator, see Rao (1994), Montanari (2000) and 
Andersson (2001), i.e., its asymptotic variance is never 
larger and usually smaller. In section 5 we actually present 
some simple simulations showing that the optimal estimator 
can be much more efficient than GREG. However, one does 
not know anything about the efficiency for finite samples, 
since the covariance estimator may converge slowly. The 
rate of convergence is illustrated in section 5. Note also that 
in some cases there exist asymptotically even better 
estimators which are not linear. 

Now, the fact that the GREG estimator is also a 
calibration estimator using 


Weis 


(w-w,) RR. (w-wW,) (7) 


as the distance measure and comparing (1) with (6), leads 
one to believe that replacing R, by R,,, in (7) should 
imply that we instead derive the optimal regression 
estimator as a calibration estimator. That this actually holds 
is shown below. 
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3. The Main Result 


In order to show existence of a distance measure 
corresponding to the optimal estimator, we will first state 
and prove a result in the general case. 


Lemma: With R denoting an arbitrary positive definite 
n Xn matrix, 


(w—w,) R(w-w,) (8) 
subject to the constraint X w =¢,, is minimised by 
W=WorR UX (XR Xx) Cony. 


Proof: Introducing the J x1 vector A of Lagrange 
multipliers, we get after differentiation the equation system 


2R(w—w,)+X'7X=0 (9) 


XW -t, =0 (10) 


Multiplying (9) by XR™', using (10) and solving for A, 
yields with X w, =f, : 


WP AO. G ile, Ges ined (see 21 ba (11) 


Putting this into (9) and solving for w finally leads to 
w=w +R X'(XR' X')'(¢,-é). 
From the lemma we thus have the following main result: 


Theorem: With R,,, being positive (semi —) definite and 
using the optimal calibration distance-measure, which we 
get by letting R=R,, (Rj) in (8), the calibration 
estimator will become the optimal regression estimator. 


Remark: R,,, may in some cases be indefinite (see below). 
The only thing we know is that it is an unbiased estimator of 
a covariance matrix. If it is not positive semi-definite there 
also exist x—values such that XR,, X " is not positive 
semi-definite, but the probability of such x - values goes to 
zero as the population and sample sizes increase (and if 
dX... 1S positive definite). A strict minimisation of a 
distance with “a negative component” would lead to 
infinitely large corrections. This problem of the optimal 
estimator has, to our knowledge, not been pointed out 
previously. 

The simplest way to find a distance which gives the 
optimal estimator as a calibration estimator is to find a 
matrix Rj; which has the same eigenvectors as R,,, but 
where the eigenvalues are replaced by their absolute values. 
(This result can be shown along the same lines as the proof 
of the lemma above. The distance can be seen as the sum of 
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the products of the eigenvalues and the squared 
eigenvectors. Putting the derivatives equal to zero means 
that in the proposition we found the extremes i.e., the 
minima for positive eigenvalues and the maxima for 
negative eigenvalues. By changing all negative signs the 
extremes will all be minima). 


4. Examples 


Positive definite R,,,: Suppose that the objects in U are 
independently drawn with inclusion _ probabilities 
T,, ..-» 1, (Poisson sampling); thus implying a random 
painiples Size Jin. where...\ 1 172) =>), 77,7.) Due, t0, athe 


independence of drawings, R,,, is diagonal and specifically 


2 
lie ps 1; 
Roe =I, eae. 


Positive semi-definite R,,: Suppose n objects are drawn 
according to simple random sampling, i.e., each object has 


inclusion probability t, =n/N. The elements of R,,, are 


fe (2) nwoaN, 
ela) TRAY: 


This means that R,,, 1s singular with rank n—1. 

Suppose instead (as in the following simulation study) 
that U is partitioned into L strata of sizes N,, ..., N;,, 
from which we draw independent simple random samples 


of sizes n,, ..., n,. The elements of R,,, then are 


2 
N N,- 
p= p(B) asonaliie 
Ny, N;, 


Zz 
ets Bs ene 

n, }) N,(n, —1) 
when in the latter case i and j both belong to stratum 
h, h=1, ..., L and 0 otherwise. Therefore R,, has rank 
IN h. 
Non positive semi-definite R,,: Let U consist of four 
elements and s of two elements. Suppose that a systematic 
sample is taken with probability 0.94 and a simple random 
sample with probability 0.06, i.e., 1,,=7,, =0.48 and 
Nl), =11, = 1M, =, =0.01. In that case 


oi 
(12) 


ake: 2 
Po 3/ (0. wa? 


14 


with probability 0.96 and 


foe 7 14506 es 
A ES EO, 


with probability 0.04. The second matrix has a negative 
eigenvalue. 

The problem does not necessarily disappear if N is 
large. Consider instead a population consisting of N/4 
strata with four elements each. Suppose that the above 
sampling procedure is used independently in each stratum. 
In that case R,,, will consist of a matrix with the above 
2 x 2 — matrices along the diagonal and zeroes elsewhere. 


5. A Simulation Study 
5.1 Notation and Outline 


In order to make empirical comparisons between the 
optimal estimator (OPT) and the GREG estimator (GREG) 
and also compare these estimators with the Horvitz- 
Thompson estimator (HT), we have conducted a small 
simulation study. In the previous sections we mentioned that 
OPT is Best Linear Asymptotic Efficient and a calibration 
estimator. Even though it has many nice properties it may 
for reasonable sample sizes be inefficient. Here we will in 
some simulated situations show that the optimal estimator 
can be a substantial improvement compared to GREG also 
for moderate sample sizes when the population is 
(deliberately) chosen to be unfavourable for GREG. A 
simple but non-trivial situation for which OPT is not equal 
to GREG arises for stratified simple random sampling, in 
particular, when the slopes differ between the different 
strata and the unstratified population. Consider therefore a 
population of size N, which is partitioned into L strata of 
sizes N,, ..., N,. From each stratum h a simple random 
sample s, of size n, is drawn, where s,+...5, =s and 
n, +...+n, =n. For simplicity we further assume that the 
auxiliary information is one-dimensional and global, i.e., 
only ¢, is known beforehand. For GREG we have chosen 
the homoscedastic simple linear regression model, see 
Sarmndal et al. (1992). 

The resulting expressions for HT, OPT and GREG 
respectively are 


wherem ae a) Bere Van = (WN) D a Ny. re 
analogous) and 
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It is easily seen from these formulae that the optimal 
regression coefficient is the mean of the within stratum 
slopes and that the GREG regression coefficient is the 
global slope. When there is a large difference between these 
slopes the GREG correction becomes bad. We are here 
particularly interested in comparing the qualities of these 
estimators when the assisting (linear) model for GREG fails. 
We have thus generated x— and y-—values from 
correlated lognormally distributed random variables X and 
Y, where InX is normally distributed with expectation 0 
and variance o; (N(0, 6,)) and InY¥ is N(0,06,). The 
variances 6; and 6; and the correlation between In X and 
InY can then be chosen to obtain prespecified values of the 
variances 6, of X and o} of Y and their correlation 
p(X, Y). Values generated from bivariate normal 
distributions were obtained by MATLAB (version 6.0). 
Twelve populations have in this manner been created, each 
of size N =10,000, including four combinations of 
variances o~ and Oo, (10 and 100) and three values of the 
correlation p(X, Y) (0.5, 0.7 and 0.9). For these 
populations a variance of 10 implies a skewness of 9.37 and 
the variance 100 leads to skewness 38.59. 

Now, before stratification, the objects of each population 
are ordered with respect to ascending y-—values. The 
number of strata is L=5 _ throughout with sizes 
N, =4,000, N= 2,500, N, = 2,000, N,=1,000 and 
N, =500. These strata are constructed in such a way that 
objects with the smallest y— values constitute stratum 1, 
and so forth. From each stratified population we have drawn 
samples of sizes n= 250, 1,000 and 2,500, where for each 
sample n, =...=n;. This means that we have created an 
approximate mps (probability proportional to size) design, 
with for example, objects in stratum 5 having the largest 
inclusion probability (n;/N,;). The number of simulated 
samples was K =25,000 for each of the 12x3=36 cases 
and HT, OPT and GREG were then computed for each 
sample. 
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In general, common measures of quality for an estimator 
f of a total t from a sequence f,, ..., #, are the estimated 
relative bias 


where 7 =(1/K) DX, @. 

Since we are mainly concerned with comparisons of 
OPT and GREG, we will only display results of the relative 
measures of variance (or equivalently standard deviation) 


Z 2 
S 
t 
Yor and == 


S HT S y HT 
from which we can compare the estimated variances of OPT 
and GREG with HT and also determine which of OPT and 


GREG have the lowest estimated variance. 


5.2 Results 


Firstly, as reference, the absolute value of the estimated 
relative bias of the unbiased HT did not in any case exceed 
4-107. The corresponding maximum values for OPT and 
GREG were 6-10~°, which means that we may concentrate 
on the ratios of estimated variances in order to evaluate 
relative efficiencies of HT, OPT and GREG. 

As seen from Table 1, OPT is superior to both HT and 
GREG (with one exception: p(X, Y)=0.9, Oo. =10, 
o =100 and n=250, where GREG has slighly less 
estimated variance). For the lowest correlation though, the 
decrease in estimated variance for OPT compared with HT 
is not substantial GREG on the other hand does not 
compete well with the others and this anomaly is 
particularly accentuated for the largest sample size 
n=2,500. Changing p(X, Y) to 0.7 means improvement 
for both OPT and GREG, but GREG is also now for most 
cases inferior to HT. Finally, for p(X, Y)=0.9 GREG still 
displays poor behavior compared with HT for n=2,500 
(with the exception of Oo. =100 and Oo; =10). In general 
GREG is closing in on OPT for increasing values of 
p(X, Y) (the assisting linear model becoming less 
misspecified), while OPT, on the other hand, is increasing 
its superiority over GREG for increasing sample sizes, 
which should come as no surprise since OPT is 
asymptotically well motivated. 
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Table 1 
Relative Estimated Efficiencies (Given as Percentages) of OPT (S 2 fs ; ut) and GREG (S a fe : ut) to HT, 


y opt 


Based on 25,000 Simulated Samples for Each Sample Size 


= 10 o, =10 o, =100 o, =100 
oe 10 90%, =100 3, =10 3, =100 
OPT GREG OPT GREG OPT GREG OPT GREG 
o(X, Y)=0.5 
n= 250 99.1 232.8 97.4 176.8 93.9 179.4 91.4 20 
n=1,000 98.3 247.1 98.0 193.7 97.5 183.5 99.9 141.9 
n= 2,500 96.8 756.7 96.8 1,455.0 97.8 534.7 96.8 16235 
p(X, Y)=0.7 
n=250 89.7 197.6 83.8 101.2 73.6 120.4 64.3 72.9 
n=1,000 91.0 2215 89.8 Lit 2 81.2 120.5 Td 84.0 
n= 2,500 93.8 648.2 91:5 1,308.6 93.1 218.6 93.1 6135 
o(X, Y)=0.9 
n=250 56.5 76.1 41.2 38.8 ple 43.4 40.4 41.4 
n=1,000 61.8 87.3 44.1 44.2 27.6 44.1 41.5 45.4 
n= 2,500 710 237.4 59.8 335.4 63.6 66.0 74.6 259.8 
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Approximations to 5° in the Prediction of Design Effects 
Due to Clustering 


Peter Lynn and Siegfried Gabler ' 


Abstract 


Kish’s well-known expression for the design effect due to clustering is often used to inform sample design, using an 


approximation such as b in place of b. If the design involves either weighting or variation in cluster sample sizes, this can 
be a poor approximation. In this article we discuss the sensitivity of the approximation to departures from the implicit 


assumptions and propose an alternative approximation. 


Key Words: Complex sample design; Intracluster correlation coefficient; Selection probabilities; Weighting. 


1. Alternative Functions 
of Cluster Size 


Kish (1965) used an expression for the design effect 
(variance inflation factor) due to sample clustering, 
deff =1+(b-1) p, where b is the number of observations 
in each cluster (primary sampling unit) and p is the 
intracluster correlation coefficient. This expression is well- 
known, is taught on courses on sampling theory, and is used 
by survey practitioners in designing and evaluating samples. 

The expression holds when there is no variation in cluster 
sample size and the design is equal-probability (self- 
weighting). We can express these two criteria formally: 


b. =bV, (1) 
where c =1, ..., C denote the clusters, and 

w,=wVi (2) 
where i=1, ..., J denote the weighting classes, with w, the 


associated design weights. 

However, most surveys involve departures from (1) and 
(2). In the general case, i.e., removing restrictions (1) and 
(2), Gabler, Hader and Lahiri (1999) showed that under an 
appropriate model, deff, =1+(b° —1) p, where 


en ae 2/7 Cb, ey ee 
(Smits) [Emi -E[dm,] [EE o 
c=1 \ i=l i=l c=1\ j=l 


c=] j=l 

and b,, is the number of observations in weighting class 7 in 
cluster c, b, = /©,b,, (we have changed the notation from 
that of Gabler er al. (1999), to provide consistency) and w,, 
is the weight associated with the j observation in cluster 
@, fpHlywggrbs 

The quantity b° can be calculated from survey micro- 
data, provided the design weight and cluster membership is 
known for each observation. However, at the sample design 
stage it is not clear how b’ can be predicted. Gabler et al. 


(1999) interpreted Kish’s b as a form of weighted average 
cluster size: 


é b. C b. 
-¥ (2 d\/> we (4) 


where b. is the number of observations in cluster 
c, b, =yj_,b,;. However, (4) is no easier than (3) to 
predict at the sample design stage. A simpler interpretation, 
perhaps commonly used in sample design, is_ the 
unweighted mean cluster size: 


De 3 b, |c=mic (5) 


It is much easier to predict b at the sample design stage 
than either b, or b’, as it requires knowledge only of the 
total number of observations, m, and total number of 
clusters, C. 


2. Relationship Between b’, b, and b 
Under Alternative Assumptions 


Wet 


= 9 el c a) me =) 
Cov(b,, bW)=— Db Ww: -— >, , 


and 
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So, in that circumstance, b° <b. If, additionally, weights 
are equal within clusters, viz: 


wy =w, VjEc (8) 
then b =b. 
If (8) holds, but not (1), then 
b >b if andonlyif Cov(b., b, Ww?) =0 
C-Covib,, b, #2) 


c 2 
ye: 
CS 


The covariance would be negative only if small cluster 
sizes coincide with large average weights within the clusters 
and vice versa. In section 4 below, we observe that this did 
not occur in any country on round 1 of the European Social 
Survey. Furthermore, from (3) and (4), we have: 


since b —b = 


(a 
al one by? [Sw be (9) 
c=l1 

If we additionally impose the restriction (1), then we 
have the obvious result b° =b,, =b =b, Ve. 

The result in (9) would apply to surveys where the only 
variation in selection probabilities was due to dispropor- 
tionate sampling between domains that did not cross-cut 
clusters. A common example would involve dispropor- 
tionate stratification by region, with PSUs consisting of 
geographical areas hierarchical to regions. 

A practical relaxation of the restriction on the variation in 
weights is: 


b, =b, (2. Vi,c. (10) 
m 


In other words, we allow variation in weights within 
clusters, but we constrain the weights to have the same 
relative frequency distribution in each cluster, ie., the 
means and the variances of the weights within clusters do 
not depend on the clusters. 
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Now, (3) simplifies as follows: 


ee (11) 


Note that ((d!_,w, b,)’)/di_,w; b, =m/(1+c2), 
where c~ is the squared coefficient of variation, across all 
observations, of the weights. Also, (X¢_,b?)/m* = 
(l+c;)/C, where c; is the squared coefficient of 
variation, across all clusters, of the cluster sample sizes. 
Thus, (11) becomes: 


See): (tcp) _ 
(hc oe oe 


=b, say. (12) 


So, b will underestimate b° if c; >c2 and vice versa. 
In particular, if w,, =w Vj,c and c, >0, then b" >b. The 
greater the variation in b., the greater the extent to which 
b will under-estimate D’. 

Assumption (10) will rarely hold exactly, but this result 
might be useful in situations where the distribution of 
weights is expected to be similar across clusters. An 
example might be address-based samples where one person 
is selected per address. If the distribution of the number of 
persons per address is approximately constant across PSUs 
(in the population), then the distribution of weights will vary 
across clusters in the sample only due to sampling variation 
and disproportionate nonresponse (the effect of this could, 
of course, be substantial if cluster sample sizes are small). 

If no restriction is imposed on the variation in weights, 
but Var(w,;) >0 for at least one c, then, from (6), 


igs Cov(b, ,b,.w? 
b* >b en CTE! 5, (13) 
es b,Var(w,; ) 


c=] 


If (10) holds, then C =c; /c?. 


3. Implications for Sample Design 


Expression (12) suggests that b° may be predicted by 
predicting the relative magnitudes of c; and c~. However, 
this result applies to a special situation, where 
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When this covariance is expected to be small, it may be 
appropriate to predict b° thus: 


(14) 


Both coefficients of variation can be estimated from 
knowledge of the proposed sample design. In the following 
section, we investigate sensitivity of predictions obtained in 
this way to assumption (10) using real data from different 
sample designs with Cov(w.,, b.) >0. 
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4. Example: European Social Survey 


The European Social Survey (ESS) is a cross-national 
survey for which great efforts have been made to achieve 
approximate functional equivalence in sample design 
between participating nations (Lynn, Hader, Gabler and 
Laaksonen 2004). Nevertheless, there is considerable vari- 
ety in the types of design used, primarily due to variation in 
the nature of available frames and in local objectives, such 
as a desire for sub-national analysis which may lead to 
disproportionate stratification by domain. We use here data 
from the first round of the ESS, for which fieldwork was 
carried out in 2002 — 2003. Of the 22 participating nations, 
17 had a clustered sample design. Of these, two had not yet 
provided useable sample data at the time of writing. In 
Table 1 we present the sample values of b’, b, c;, ci, 
b,|b-b |, |b-D |, Corr(w,;,b.), and ¢ for the 
remaining 15. Note that the United Kingdom and Poland 
both had a 2 — domain design with the sample clustered only 
in one domain, namely Great Britain (Ze., excluding 
Northern Ireland) and less densely-populated areas (i.e., all 
except the largest 42 towns) respectively. Figures presented 
in table 1 relate only to the clustered domain. 


Sample Values of b', b, cy, c,, b|b-b |, |b-b |, Corr(w,, b,), and 6, for 15 Surveys 


Table 1 
Country b b Ci; c b |b-b | [b-bd | Corr(wj, 5.) s 
Austria AT 6.49 7.08 0.08 0.25 6.15 0.34 0.58 0.0036 0.4549 
Belgium BE 6.56 SOF BOA 0.00 6.56 0.00 0.77 
Switzerland CH 8.83 O23 ee JOR 8.50 0.34 0.40 00273 0.7060 
Czech Republic CZ 2.94 201 WROD. 025 2.68 0.26 0.24 0.0225 1.7350 
Germany ye | a hs Ke adh Uh a im 7 i i A 2 1.43 0.72 —0.2287 
Spain ES 4.96 5:044e00.47 “1022-6 4.80 0.15 0.08 —0.0767 0.8757 
Great Britain Cpe Pirie tet <O.0cmer) 22 10.90 0.21 1.16 0.0114 0.4198 
Greece GR 5.47 DOO mmw09 a D220 Py 5.25 0.22 0.39 —0.0280 0.5207 
Hungary HU 8.68 8.18 0.06 0.00 8.68 0.00 0.50 
Ireland IE 1098 tS Os 3 Gas0 OAs 1205 0.05 Oy 0.0006 3.1054 
Israél IL b17O 5612.82 O12 70627 6927. Zo 1.02 —0.1271 0.4401 
Italy IT 1O98* “10:37-— 0267 - 0466, 11.80 0.83 0.10 —0.5589 1.3018 
Norway NO yp 4409 piel 8.687 etl 33 9:0 0Ol<- 43.32 0.77 25.41 0.0807 
Poland (rural) PE 10.07 9.45 0.06 0.01 9.88 0.19 0.62 0.2923 
Slovenia SI L076" O18. 70.06 © 2000" 10:76 0.00 0.63 
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From (12), we would expect to observe b>b> when 
é- >é;. A common sample design for which this 
inequality can be anticipated is one where, a) the selected 
cluster sample size is constant, so variation in b, will be 
limited to that caused by differential non-response; and b) 
the samples are equal-probability samples of addresses, with 
subsequent random selection of one person per address, 
leading to variation in design weights reflecting the 
variation in household size. There are six nations with 
sample designs of this type (AT, CH, ES, GB, GR, IL). It is 
indeed the case that for all of these nations, ¢<1 and 
b >b . Furthermore, for 5 of these 6 nations (AT, CH, ES, 
GB GRE = er) Wwe mleiiie expecuntrU) tor ber a 
reasonable approximation as the only variation in weights is 
that due to selection within a household/address. For these, 
we might TPS b to perform better pet b. Indeed, 
Ua b 

> _i|o-—b |=0.48. The one aoe where i would not 
Sue an improvement is Spain and this is to be expected 
as b is small. Small cluster sample sizes leave them 
relatively more susceptible to the effects of nonresponse and 
also sampling variance, which will lead to violation of (10). 
In Israel, there was a further source of variation in design 
weights as there was disproportionate stratification by 
geographical areas. This too causes violation of (10), so we 
would not expect b necessarily to provide an improvement 
on b asapredictor of b’. 

Of the nations where c, <c;,, there is only one (CZ) for 
which b <b and €>1. This is also the nation with the 
smallest value of b When cluster sample sizes are 
particularly small, deff will be small and the choice between 
estimators of b may be less important. 

There are five nations where sample units were 
individuals selected with equal probabilities (within 
clusters) from population registers (BE, DE, HU, PL, SI). In 
this case (8) (and, therefore, (10)) holds strictly, so we have 
b <b. For three of these nations (BE, HU, SI) the sample 
is equal-probability, so we observe b =b*. Itisclear that b 
is superior to b for equal-probability samples. For 
Germany and Poland, there is some variation in design 
weights between clusters (but not within). This variation is 
"|, but the same is 
not true in Germany, where the ex-East Germany was 
sampled at a considerably higher rate than the ex-West 
Germany. 

The Norwegian sample design was the only one that 
resulted in considerable variation in cluster sample sizes at 
the selection stage. The dramatic impact of this on b =D” 
can clearly be seen. Again, this is a situation in which b is 
likely to be preferable to b as a predictor of Db’. 

The designs in Ireland and Italy both involved selecting 
addresses from the electoral registers with probability 


Statistics Canada, Catalogue No. 12-001-XPB 


proportional to number of electors and then selecting one 
resident at random from each selected address. Such designs 
are not equal-probability, but are likely to result in 
considerably less variation in design weights than the 
address-based sample designs discussed earlier (Lynn and 
Pisati 2005). In both these cases, ¢* <¢;, the difference 
being greater in the case of Italy where some cluster sample 
sizes (in the largest municipalities) were considerably larger 
than the others (in Ireland, all were equal at the selection 
stage). Aside from the Czech Republic, these are the only 
two nations with ¢>1. 


5. Conclusion 


To aid prediction of the design effect due to clustering, 
we believe that b is likely to be a better choice than b asa 
predictor of b° in situations where it can reasonably be 
expected that (10) will approximately hold. This includes, 
but is not restricted to, the following common types of 
sample design: 


—  Equal-probability designs where cluster sample sizes 
vary by design; 

— Equal-probability designs where clusters do not vary 
by design but are likely to vary due to nonresponse; 


— Address-based samples where one person is selected 
at each address, there is no other significant source of 
variation in selection probabilities, and cluster sizes 
do not vary by design. 
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A Note on the C,, Statistic Under the Nested Error Regression Model 


Jane L. Meza and P. Lahiri ' 


Abstract 


Nested error regression models are frequently used in small-area estimation and related problems. Standard regression 
model selection criterion, when applied to nested error regression models, may result in inefficient model selection methods. 
We illustrate this point by examining the performance of the Cp statistic through a Monte Carlo simulation study. The 
inefficiency of the Cp statistic may, however, be rectified by a suitable transformation of the data. 


Key Words: Cp Statistics; Nester error regression model; Monte Carlo simulation. 


1. Introduction 


This paper examines the limitations of a standard 
regression model selection criterion, C, the statistic, for 
nested error regression models. The C, statistic (Mallows 
1973) is defined by 


RSS, 


6 


Cp= 


—n+2p (1) 


where RSS, is the residual sum of squares and p is the 
number of parameters for model P, n is the number of 
observations and 67 is an estimate of o*. If the model is 
correct, the value of C, should be similar to or smaller than 
p. The C, model selection criterion is sensitive to outliers 
and departures from the normal i.i.d. assumption on the 
errors. The C, statistic therefore cannot be directly applied 
to the nested error regression model since here the error 
structure is not 1.i.d. 

We propose a transformation that adjusts for intracluster 
correlation and allows use of the standard C, model 
selection criterion. The method presented in this paper can 
be applied to select covariates in the analysis of complex 
survey data and small-area models. For example, our 
technique could be used to select covariates in the nested 
error regression model used by Battese, Harter and Fuller 
(1988) to estimate the area planted (in hectacres) with corn 
or soybeans for twelve Iowa counties. They used the 
following model: 


Yi = Boab +, +6,, (2) 


for unit j=1,...,n, mcounty i=1,...,m, where n, is the 
sample size for small area i and the total sample size is 
n= 7",n;. The county effects, v,, are distributed as 
N(O, GO.) independent of the random errors ei» which are 
distributed as N(0, 62). The area (in hectacres) in unit j of 


county i is denoted by y, and x, =(1, x;,...,%;,) 18 a 


p+l1 vector of the values of the covariates x,,..., Xy for 
unit j in county i. The vector B=(B),,,...,B,) is a 
p +1 vector of unknown parameters. 

The nested error regression model can be expressed in 
matrix form as 


y=XBr+e (3) 


where y= (fo. Ve) 1 = ares Yn EHC Oa) 
Panta ey) En awvare,. Furtner, X= (Xi week) 
where X; is an n;X(p+l1) matrix with rows x, for 
j=l,...,n,,€~N(0,0°V) where 067 =67+062,V_ has 
block-diagonal form ©/'V, with V,=(1—-p)/, +pJ,, 
where p = Oo. /o° is the common instrastratum correlation, 
I, 1s the n; Xn, identity matrix and J, is the n; xn, unit 
matrix. 

Since the nested error model does not have 1.1.d errors, 
standard regression procedures do not apply. The simulation 
study in section 3 reveals that the C, criterion does not 
perform well under the nested error regression model. The 
transformations considered in the next section are used to 
transform the nested error regression model into a standard 
regression model with i.i.d. errors. With these transformed 
observations, the C, criterion performs much better. 


2. Adjusting for Intra-area Correlations 


As noted in the previous section, conventional model 
selection methods like the C, criterion are not appropriate 
since the instrastratum correlations are ignored. Wu, Holt 
and Holmes (1988) and Rao, Sutradhar and Yue (1993) 
studied the effect of conventional methods for the nested 
error regression model in a different context. 

Consider the nested error regression model and let 
O° =0; +0, and — be the common intra-area correlation, 
9 =0-/o°. As in Fuller and Battese (1973) and Rao et al. 


1. Jane L. Meza, University of Nebraska Medical Center, 984350 Nebraska Meu.cal Center, Omaha, NE 68198-4350. E-mail: jmeza@unmc.edu; P. Lahiri, 
University of Maryland at College Park, 1218 Le Frak Hall, College Park, MD 20742-8241. E-mail: Plahiri@survey.umd.edu. 
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(1993), transform the nested error regression model into a 
standard regression model with 1.1.d. errors. 


Let 
1/2 
o,, =1-| ——? (4) 
Li — D0 
Vy = Vi — Ui: (5) 
at = Xj —O;%;, (6) 


where y,=Diiiy,/n, and x; = 
formed model then becomes 


Vy = ty Pees, (7) 


n; = 
ha Xj /n;. The trans 


for j=1,...,;,i=1,...,m and e, are independently dis- 
tributed as N(0,0;). Now, the standard C, model 
selection criterion may be applied to the transformed data. 

In practice, p is usually unknown and must be estimated 
from the data. Rao et al. (1993) used Henderson’s (1953) 
method to obtain unbiased quadratic estimators 6; and 6; 
of the variance components 6, and o%. Once the 
estimators have been obtained, p =0; /(o, +6.) may be 
estimated by 


6? 
6 = a0 soy c (8) 


To obtain the estimators of the variance components, let 
{u,,} be the residuals from the ordinary least squares 
regression of {y, —Y,,} OM {Xj) —Xj4)--+> Xj — Xp} with- 
out the intercept term, where x,,=D71) x,,/n, for 
/=1,..., p. Let {7} be the residuals from the ordinary 


ij 
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least squares regression of y, On {Xj,-.-, Xj} With the 
intercept term. 
The estimators of 6; and 6% are given by 
6. =(n—m-p-1-aA)" > de, (9) 
i=l j=l 
ae =nlS ‘i B-0~ p83) (10) 
i=t ja 
neanatr| x x"S hae | (11) 
i=l 


where 1 =0 if the model has no intercept term and A =1 
otherwise. We propose to apply standard C, model 
selection criterion on these transformed observations yj 
and x;. 


3. A Simulation Study 


A simulation study was conducted to examine the 
behavior of the C, model selection criterion and the 
proposed transformations for the nested error regression 
model. The following model was considered: 


Tia By Xo + B, Xi + B, Xijo T B; Xij3 + B, Xiyq tv, +e; (12) 


for i=l, 10, nee {252.55}, of = Lean, endan=40, 
The v, are distributed as N(0,6.) independent of ej 
which are distributed as N(0,1). The data x,, are taken 
from an example given by Gunst and Mason (1980) and 
included in Shao (1993) (Table 1). The value of x; is 1 for 
alles. 610, spa Moe: 


Table 1 
Data for Nested Error Simulation 
xX} X4 x3 X4 xX X4 x3 X4 
0.3600 0.5300 1.0600 0.5326 0.0900 0.1800 0.5900 0.1855 
1.3200 2.5200 5.7400 3.6183 0.0200 0.1600 0.2400 0.1572 
0.0600 0.0900 0.2700 0.2594 0.0200 0.1100 0.2100 0.0998 
0.1600 0.4100 0.8300 1.0346 0.0500 0.2400 0.4300 0.2804 
0.0100 0.0200 0.0700 0.0381 0.1100 0.3900 0.2900 0.2879 
0.0200 0.0700 0.0700 0.3440 0.1800 0.1100 0.4300 0.6810 
0.5600 0.6200 2.1200 1.4559 0.0400 0.0900 0.2300 0.3242 
0.9800 1.0600 2.8900 4.0182 0.8500 1.3300 2.7000 2.6013 
0.3200 0.2000 0.7600 0.4600 0.1700 0.3200 0.6600 0.4469 
0.0100 0.0000 0.0700 0.1540 0.0800 0.1200 0.4900 0.2436 
0.1500 0.2500 0.5000 0.6516 0.3800 0.1800 0.4900 0.4400 
0.2400 0.2800 0.5900 0.0611 0.1100 0.1300 0.1800 0.3351 
0.1100 0.3500 0.4000 0.1922 0.3900 0.3800 0.9900 1.3979 
0.0800 0.1300 0.2800 0.0931 0.4300 0.4600 1.4700 2.0138 
0.6100 0.8500 0.4900 0.0538 0.5700 1.1600 1.8200 1.9356 
0.0300 0.0300 0.2300 0.0199 0.1300 0.0300 0.0800 0.1050 
0.0600 0.1100 0.5000 0.0419 0.0400 0.0500 0.1400 0.2207 
0.0200 0.0800 0.2500 0.1093 0.1300 0.1800 0.2800 0.0180 
0.0400 0.2400 0.0800 0.0328 0.2000 0.9500 0.4100 0.1017 
0.0000 0.0200 0.0400 0.0797 0.0700 0.0600 0.1800 0.0962 
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Some of the 8, may be zero and thus various 
combinations of variables were chosen from 
(Xo, X,, X, X3, X,) to be the predictors used to generate data 
coming from a nested error regression model. There are 
2’ -1=31 possible models. Each model will be denoted 
by a subset of (0, 1, 2, 3, 4) that contains the indices of the 
variables x, in the model. 

Data were generated using 1,000 simulations for several 
values of 6. to estimate the probability of selecting each 
model using the C, criterion. The value of 62 was taken to 
be 1 for all simulations. The results of the simulation are 
given in Table 2. The values of o. considered were 0, 1, 2, 
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5, 10 and 16 and the values of B’ were taken to be (2, 0, 0, 
4, 0), (2, 0, 0, 4, 8), (2, 9, 0, 4, 8) and (2, 9, 6, 4, 8) as in 
Shao (1993). Models were categorized as optimal, category 
II (correct but not optimal), or category I (incorrect). 

The C, criterion did not perform well for large values of 
o.. For the model B’=(2, 0, 0, 4, 0) with o? =1 the 
estimated selection probabilities were: optimal model, 0.54; 
correct model, 0.46; incorrect model, 0. In contrast, when 
6. =16, the estimated selection probabilities were: optimal 
model, 0.43; correct model, 0.35; incorrect model, 0.22. 

The C, criterion also did not perform well for larger 
models with large values of 62. The C, criterion however 


Table 2 
Probabilities of Model Selection Before Transformation 

6b =(2,0, 0:4,0)" 
Model Cheory O-f0" Ot 1G Os ote h «oF e10.. Go, =16 
0,3 Optimal 0.62 0.54 0.49 0.46 0.45 0.43 
Oe 2a Il 0.11 0.09 0.09 0.10 0.07 0.06 
O13 II 0.09 0.14 0.19 0.17 0.15 0.12 
0, 3,4 sit 0.09 0.13 0.13 0.14 0.11 0.10 
Usi2 3 II 0.03 0.05 0.06 0.05 0.04 0.04 
0, 1, 3,4 sit 0.02 0.03 0.02 0.02 0.02 0.01 
0, 2, 3,4 I 0.02 0.01 0.02 0.02 0.01 0.02 
Os 4 sit 0.02 0.01 0.00 0.00 0.01 0.00 
0, 1 I 0.00 0.00 0.00 0.01 0.07 0.09 
0,2 I 0.00 0.00 0.00 0.01 0.03 0.05 
0, 4 I 0.00 0.00 0.00 0.00 0.01 0.04 
OFZ I 0.00 0.00 0.00 0.01 0.01 0.01 
0, 1,4 I 0.00 0.00 0.00 0.01 0.02 0.03 
0, 1, 2,4 I 0.00 0.00 0.00 0.00 0.00 0.00 

B =(2, 0, 0, 4, 8)’ 
Model Category  o2= Oa 1 = So = 2. ice = Sita 10 116 
0, 3,4 Optimal 0.72 0.67 0.63 0.61 0.58 0.49 
0) 25 344 I 0.12 0.12 0.14 0.14 0.11 0.09 
0; 15354 sit 0.12 0.16 0.18 0.14 0.12 OA1 
ONE 253 <4 I 0.04 0.05 0.05 0.05 0.04 0.04 
0, 4 I 0.00 0.00 0.00 0.00 0.01 0.06 
0, 1,4 I 0.00 0.00 0.00 0.02 0.05 0.10 
0, 2,4 I 0.00 0.00 0.00 0.03 0.07 0.10 
0, 1,2,4 I 0.00 0.00 0.00 0.00 0.01 0.01 

B=(, 9; 0,4, 8) 
Model Category oc? = Ge = o =2 o =35 o =10 o kG 
0, 1, 3,4 Optimal 0.83 0.78 0.75 0.63 0.39 0.25 
0:4, 2.23.4 I 0.17 0.20 0.18 0.13 0.09 0.07 
0, 3, 4 I 0.00 0.01 0.03 0.13 0.29 0.35 
0, 1,4 I 0.00 0.00 0.00 0.03 0.11 0.15 
0722334 I 0.00 0.01 0.03 0.07 0.06 0.09 
0, 2,4 I 0.00 0.00 0.00 0.00 0.02 0.05 
0, 1, 2,4 I 0.00 0.00 0.00 0.02 0.04 0.04 

B = (2, 9, 6, 4, 8)’ 
Model Category o2= o =) (Alo: =2. cho =5.uueori=10 4 o% 16 
0,1,2,3,4 Optimal 1.00 0.98 0.90 0.60 0.29 0.11 
0, 2, 3,4 I 0.00 0.02 0.07 0.24 0.32 0.28 
0, 1, 3,4 I 0.00 0.00 0.02 0.11 0.18 0.23 
0, 1, 2,4 I 0.00 0.00 0.01 0.06 0.13 0.17 
0, 3, 4 I 0.00 0.00 0.00 0.00 0.03 0.09 
0, 2,4 I 0.00 0.00 0.00 0.00 0.03 0.10 
0, 1,4 I 0.00 0.00 0.00 0.00 0.01 0.03 
01,3 I 0.00 0.00 0.00 0.00 0.00 0.00 


. 
’ 
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did very well for large models with small values of 67. For 
the full model B’ = (2, 9, 6, 4, 8) with o? =1, the estimated 
selection probabilities were: optimal model, 0.98; correct 
model, 0.02; incorrect model, 0. In contrast, when or = 16; 
the estimated selection probabilities were: optimal model, 
0.11; incorrect model, 0.89. Note that in this scenario there 
are no correct models other than the optimal model. 

In summary, when the C, criterion is applied to data 
following the nested error regression model: 


1. For any particular model, the estimated probability 
of selecting the optimal model decreases as 6% 
increases. 


2. For any particular model, the estimated probability 
of selecting an incorrect model increases as 6% 
increases. 


3. As the number of variables included in the model 
increases and ©, increases, the estimated proba- 
bility of selecting the optimal model decreases. 


4. _ As the number of variables included in the model 
increases and om increases, the estimated 
probability of selecting an incorrect model 
increases. 
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The data were then used to estimate the probability of 
selecting each model using the C, criterion under the 
transformation for p known. The results of the simulation 
are given in Table 3. For the model B’ = (2, 0,0, 4,0) with 
6. =0 (standard regression model) the estimated selection 
probabilities were: optimal model, 0.62; correct model, 
0.38; incorrect model, 0 (Table 2). Similarly, under the 
transformation for p known with o. =16, the estimated 
selection probabilities were: optimal model, 0.60; correct 
model, 0.40; incorrect model, 0 (Table 3). For the full model 
B’ = (2, 9, 6, 4, 8), the estimated probability of selecting the 
optimal model was 1 for both the standard regression model 
(Table 2, oh =() and under the transformation for p 
known for all values of om considered (Table 3). 

In practice, p is unknown and must be estimated from 
the data. The transformation for p unknown is therefore 
more helpful for practitioners. The results for the trans- 
formation with p unknown are displayed in Table 4. When 
O was estimated, there was only a small decrease in the 
estimated probability of selecting the optimal model or a 
correct model. The largest decrease in the estimated 
probability of selecting the optimal model was 0.03 for the 
model with B’ = (2,0, 4,0) and 6% =1,0.61 for p known 
(Table 3) compared to 0.58 for p unknown (Table 4). 


Table 3 


Probabilities of Model Selection After Transformation, p Known 


B=(2, 0, 0, 4, 0)" 


Model Category; . 64 =1 ©. Ge =2 02 =5 o* =10 G+ =16 
0,3 Optimal 0.61 0.60 0.61 0.61 0.60 
0, 3,4 1 0.11 0.10 0.11 0.11 0.11 
0, 2,3 0 0.10 0.11 0.11 0.10 0.11 
Oains I 0.09 0.10 0.08 0.09 0.09 
Galn2.3 1 0.04 0.04 0.04 0.04 0.04 
0:1153,4 I 0.03 0.03 0.03 0.02 0.02 
0, 2, 3,4 0 0.02 0.02 0.02 0.02 0.02 
OME? 374 Il 0.01 0.01 0.01 0.01 0.01 
B=(2, 0,0, 4,8)’ 
Model Category o7=1 o%=2 o7 =5 o* =10 o* =16 
0, 3,4 Optimal _— 0.71 0.71 0.73 0.72 0.71 
0, 2, 3,4 0 0.13 0.12 0.11 0.12 0.13 
Ori3,4 I 0.11 0.12 0.10 0.11 0.11 
01,)2..3.4 HW 0.05 0.05 0.05 0.05 0.05 
B=(2,9, 0, 4,8)’ 
Model Category 072=1 o2=2 07 =5 o? =10 o* =16 
OMI: 4 Optimal 0.82 0.83 0.83 0.82 0.83 
071: 253,4 II 0.18 0.17 0.17 0.18 0.17 
B=(2,9, 6, 4, 8)’ 
Model Category. 67 =1 07 =2 o7 =5 o- =10 o* =16 
0,1,2,3,4 Optimal 1.00 1.00 1.00 1.00 1.00 
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Table 4 
Probabilities of Model Selection After Transformation, p Unknown 


B =(2, 0,0, 4, 0)" 


Model Category o =1 o? =2 
0,3 Optimal 0.58 0.59 
0, 3,4 st 0.11 0.10 
0,235 tt 0.11 0.10 
O23 II 0.08 0.09 
O23 I 0.04 0.04 
0, 1, 3,4 I 0.03 0.03 
0, 2, 3, 4 I 0.03 0.03 
0, 1, 2, 3,4 I 0.02 0.02 


G=5 o” =10 Oo. =16 
0.60 0.61 0.60 
0.11 0.10 0.10 
0.11 0.11 0.11 
0.10 0.09 0.09 
0.03 0.04 0.04 
0.02 0.02 0.02 
0.02 0.02 0.03 
0.01 0.01 0.01 


B= (2,0, 0, 4, 8)’ 


Model Category o2=1 o2=2 0? =5 o? =10 o? =16 
0, 3,4 Optimal 0.70 0.70 0.70 0.71 0.70 
01 234,4 Il 0.13 0.14 0.13 0.13 0.13 
0, 1,3,4 I 0.13 0.11 0.12 0.11 0.12 
0, 1,2, 3,4 0 0.04 0.05 0.05 0.05 0.05 
B=(2,9, 0, 4,8)’ 
Model Category o7=1 o%=2 07 =5 o? =10 o? =16 
Gnia3.4 Optimal 0.82 0.82 0.81 0.83 0.83 
0:1,2;3.4 0 0.18 0.18 0.19 0.17 0.17 
B=(2,9,6, 4,8)’ 
Model Category O7=1 o%=2 o7 =5 o? =10 o* =16 
0,1,2,3,4 Optimal _1.00 1.00 1.00 1.00 1.00 
Based on our simulation results, when the C, criterion is Acknowledgements 


applied to data following the nested error regression model: 


1. Under both transformations (p known and p 
unknown), the estimated probability of selecting an 
incorrect model was 0. 


2. Under the transformation for p known, the 
probability of selecting the optimal model was 
similar to that of the standard regression model. 


3. When p was estimated, there was only a small 
decrease in the estimated probability of selecting 
the optimal model or a correct model. 


4. Under both transformations (p known and p 
estimated), the C, criterion performed well, even 
for larger models with large values of 0° . 


5. The performance of the C, criterion for the nested 
error regression model resembles that of the C, 
criterion for the standard regression model. 


In summary, the C, criterion does not perform well 
under the nested error regression model when o°% is large. 
When the transformation for p unknown (or p known) is 
applied, the model then becomes a standard regression 
model and the C, statistic performs accordingly. 
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Some of the B, may be zero and thus various 
combinations of variables were chosen from 
(Xp, X;, X>, X;, X,) to be the predictors used to generate data 
coming from a nested error regression model. There are 
2’ -1=31 possible models. Each model will be denoted 
by a subset of (0, 1, 2, 3, 4) that contains the indices of the 
variables x, in the model. 

Data were generated using 1,000 simulations for several 
values of 67 to estimate the probability of selecting each 
model using the C, criterion. The value of 62 was taken to 
be 1 for all simulations. The results of the simulation are 
given in Table 2. The values of om considered were 0, 1, 2, 


107 


5, 10 and 16 and the values of B’ were taken to be (2, 0, 0, 
4, 0), (2, 0, 0, 4, 8), (2, 9, 0, 4, 8) and (2, 9, 6, 4, 8) as in 
Shao (1993). Models were categorized as optimal, category 
II (correct but not optimal), or category I (incorrect). 

The C, criterion did not perform well for large values of 
o.. For the model B’=(2, 0, 0, 4, 0) with o? =1 the 
estimated selection probabilities were: optimal model, 0.54; 
correct model, 0.46; incorrect model, 0. In contrast, when 
6. =16, the estimated selection probabilities were: optimal 
model, 0.43; correct model, 0.35; incorrect model, 0.22. 

The C, criterion also did not perform well for larger 
models with large values of o>. The C, criterion however 


Table 2 
Probabilities of Model Selection Before Transformation 

B = (2,0, 0, 4, 0)’ 
Model Cileoory) “ous 0 0 = 1 iG 220 Oe A Onell 6 HIG 
Of Optimal 0.62 0.54 0.49 0.46 0.45 0.43 
On23 il 0.11 0.09 0.09 0.10 0.07 0.06 
Osls3 I 0.09 0.14 0.19 O17 0.15 0.12 
0, 3, 4 I 0.09 0.13 0.13 0.14 0.11 0.10 
ig Pea) I 0.03 0.05 0.06 0.05 0.04 0.04 
0, 1,3,4 I 0.02 0.03 0.02 0.02 0.02 0.01 
0, 2, 3,4 I 0.02 0.01 0.02 0.02 0.01 0.02 
0, 1, 2, 3,4 Ul 0.02 0.01 0.00 0.00 0.01 0.00 
0, 1 I 0.00 0.00 0.00 0.01 0.07 0.09 
02, I 0.00 0.00 0.00 0.01 0.03 0.05 
0, 4 I 0.00 0.00 0.00 0.00 0.01 0.04 
On? I 0.00 0.00 0.00 0.01 0.01 0.01 
0, 1,4 I 0.00 0.00 0.00 0.01 0.02 0.03 
0, 1,2,4 I 0.00 0.00 0.00 0.00 0.00 0.00 

B=(2, 0, 0, 4, 8)’ 
Model Category o7= O21 «2 =2 “or = 67=10. ‘65 =16 
0, 3,4 Optimal 0.72 0.67 0.63 0.61 0.58 0.49 
0, 2, 3, 4 0 0.12 0:12 0.14 0.14 0.11 0.09 
0) 1354 tT 0.12 0.16 0.18 0.14 0.12 0.11 
0, 1, 2, 3,4 I 0.04 0.05 0.05 0.05 0.04 0.04 
0,4 I 0.00 0.00 0.00 0.00 0.01 0.06 
0, 1,4 I 0.00 0.00 0.00 0.02 0.05 0.10 
0, 2,4 I 0.00 0.00 0.00 0.03 0.07 0.10 
0, 1,2,4 I 0.00 0.00 0.00 0.00 0.01 0.01 

B= (2,9, 0, 4, 8)’ 
Model Category Oo. = o- =1 O° =2 o =5 o. =10 on =16 
0, 1, 3,4 Optimal 0.83 0.78 0.75 0.63 0.39 0.25 
ORL) 25.4 0 0.17 0.20 0.18 0.13 0.09 0.07 
0, 3,4 I 0.00 0.01 0.03 0.13 0.29 0.35 
0, 1,4 I 0.00 0.00 0.00 0.03 0.11 0.15 
0, 2, 3,4 I 0.00 0.01 0.03 0.07 0.06 0.09 
0, 2,4 I 0.00 0.00 0.00 0.00 0.02 0.05 
0, 152,4 I 0.00 0.00 0.00 0.02 0.04 0.04 

B = (2, 9, 6, 4, 8)’ 
Model Category) 67 =0). of =1 (3/07 =2.. 907 =5 2 =10 4.07216 
0,1,2,3,4 Optimal 1.00 0.98 0.90 0.60 0.29 0.11 
0, 2, 3,4 I 0.00 0.02 0.07 0.24 0.32 0.28 
0, 1,3,4 I 0.00 0.00 0.02 0.11 0.18 0.23 
0, 1,2,4 I 0.00 0.00 0.01 0.06 0.13 0.17 
0, 3, 4 I 0.00 0.00 0.00 0.00 0.03 0.09 
0, 2,4 I 0.00 0.00 0.00 0.00 0.03 0.10 
0, 1,4 I 0.00 0.00 0.00 0.00 0.01 0.03 
013 I 0.00 0.00 0.00 0.00 0.00 0.00 
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did very well for large models with small values of o.. For 
the full model B’ = (2, 9, 6, 4, 8) with om =1, the estimated 
selection probabilities were: optimal model, 0.98; correct 
model, 0.02; incorrect model, 0. In contrast, when Ga mit Ne 
the estimated selection probabilities were: optimal model, 
0.11; incorrect model, 0.89. Note that in this scenario there 
are no correct models other than the optimal model. 

In summary, when the C, criterion is applied to data 
following the nested error regression model: 


1. For any particular model, the estimated probability 
of selecting the optimal model decreases as ar 
increases. 


2. For any particular model, the estimated probability 
of selecting an incorrect model increases as o. 
increases. 


3. As the number of variables included in the model 
increases and 6. increases, the estimated proba- 
bility of selecting the optimal model decreases. 


4. As the number of variables included in the model 
increases and oo. increases, the estimated 
probability of selecting an incorrect model 
increases. 


The data were then used to estimate the probability of 
selecting each model using the C, criterion under the 
transformation for p known. The results of the simulation 
are given in Table 3. For the model B’ =(2,0,0, 4,0) with 
Oo. =( (standard regression model) the estimated selection 
probabilities were: optimal model, 0.62; correct model, 
0.38; incorrect model, 0 (Table 2). Similarly, under the 
transformation for p known with o7 =16, the estimated 
selection probabilities were: optimal model, 0.60; correct 
model, 0.40; incorrect model, 0 (Table 3). For the full model 
B’ = (2, 9, 6, 4, 8), the estimated probability of selecting the 
optimal model was 1 for both the standard regression model 
(Table 2, 67 =0) and under the transformation for p 
known for all values of 6~ considered (Table 3). 

In practice, p is unknown and must be estimated from 
the data. The transformation for p unknown is therefore 
more helpful for practitioners. The results for the trans- 
formation with p unknown are displayed in Table 4. When 
© was estimated, there was only a small decrease in the 
estimated probability of selecting the optimal model or a 
correct model. The largest decrease in the estimated 
probability of selecting the optimal model was 0.03 for the 
model with B’ = (2,0, 4,0) and o7 =1,0.61 for p known 
(Table 3) compared to 0.58 for p unknown (Table 4). 


Table 3 
Probabilities of Model Selection After Transformation, p Known 


B = (2,0, 0, 4, 0)’ 


Model Category 02 =1 07 =2 o =5 o* =10 o? =16 
0,3 Optimal 0.61 0.60 0.61 0.61 0.60 
0, 3,4 0 0.11 0.10 0.11 0.11 0.11 
0, 2,3 0 0.10 0.11 0.11 0.10 0.11 
Duke) I 0.09 0.10 0.08 0.09 0.09 
Osie2,8 0 0.04 0.04 0.04 0.04 0.04 
0, 1,3,4 Il 0.03 0.03 0.03 0.02 0.02 
0293,4 Il 0.02 0.02 0.02 0.02 0.02 
0; 152,3,4 0 0.01 0.01 0.01 0.01 0.01 
B=(2,0,0, 4,8) 
Model Category O2=1 62 =2 o* =5 o* =10 o” =16 
0,3,4 Optimal 0.71 0.71 0.73 0.72 0.71 
Orage? 0 0.13 0.12 0.11 0.12 0.13 
0, 1,3,4 II 0.11 0.12 0.10 0.11 0.11 
Quin? 354 HW 0.05 0.05 0.05 0.05 0.05 
B=(2,9, 0, 4,8)’ 
Model Category o7=1 (of =2 o7 =5 o- =10 o? =16 
0, 1,3,4 Optimal 0.82 0.83 0.83 0.82 0.83 
0,12, 3, 4 Il 0.18 0.17 0.17 0.18 0.17 
B= (2,9, 6, 4, 8)’ 
Model Category o%=1 o7=2 G7 = o? =10 o* =16 
0,1,2,3,4 Optimal 1.00 1.00 1.00 1.00 1.00 


Statistics Canada, Catalogue No. 12-001-XPB 


Survey Methodology, June 2005 


109 


Table 4 
Probabilities of Model Selection After Transformation, p Unknown 


B =(2, 0, 0, 4, 0)’ 


2 


Model Category o =1 0), =2 
0,3 Optimal 0.58 0.59 
0, 3, 4 I 0.11 0.10 
0, 2, 3 I 0.11 0.10 
O43 I 0.08 0.09 
OS 0 0.04 0.04 
0, 1, 3,4 I 0.03 0.03 
0, 2, 3,4 I 0.03 0.03 
0, 1, 2, 3,4 I 0.02 0.02 


0. =5 o* =10 o* =16 
0.60 0.61 0.60 
0.11 0.10 0.10 
0.11 0.11 0.11 
0.10 0.09 0.09 
0.03 0.04 0.04 
0.02 0.02 0.02 
0.02 0.02 0.03 
0.01 0.01 0.01 


B = (2, 0, 0, 4, 8)’ 


Model Category o7=1 o2=2 o* =5 o- =10 o? =16 
O3;4 Optimal 0.70 0.70 0.70 0.71 0.70 
ON 2N354 Il 0.13 0.14 Ons 0.13 0.13 
ONS aa I 0.13 0.11 0.12 0.11 0.12 
OMe 3 24 I 0.04 0.05 0.05 0.05 0.05 
6=(2)9,0)4, 3) 
Model Category o7=1 o2=2 Oo? =5 o* =10 o* =16 
OMINSe4: Optimal 0.82 0.82 0.81 0.83 0.83 
On 2354 Ul 0.18 0.18 0.19 0.17 Ona 
B=@,9,6,4,8). 
Model Category o =1 a. =2 0) =5 6; =10 0. =16 
0,1,2,3,4 Optimal 1.00 1.00 1.00 1.00 1.00 
Based on our simulation results, when the C, criterion is Acknowledgements 


applied to data following the nested error regression model: 


1. Under both transformations (p known and p 
unknown), the estimated probability of selecting an 
incorrect model was 0. 


2. Under the transformation for p known, the 
probability of selecting the optimal model was 
similar to that of the standard regression model. 


3. When p was estimated, there was only a small 
decrease in the estimated probability of selecting 
the optimal model or a correct model. 


4. Under both transformations (p known and p 
estimated), the C, criterion performed well, even 
for larger models with large values of 6°. 


5. The performance of the C, criterion for the nested 
error regression model resembles that of the C, 
criterion for the standard regression model. 


In summary, the C, criterion does not perform well 
under the nested error regression model when 7% is large. 
When the transformation for p unknown (or p known) is 
applied, the model then becomes a standard regression 
model and the C, statistic performs accordingly. 


The research was supported in part by a grant from the 
Gallup Organization. 
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GUIDELINES FOR MANUSCRIPTS 


Before having a manuscript typed for submission, please examine a recent issue of Survey Methodology (Vol. 19, No. 1 and 
onward) as a guide and note particularly the points below. Articles must be submitted in machine-readable form, preferably 
in Word. A paper copy may be required for formulas and figures. 


Layout 


Manuscripts should be typed on white bond paper of standard size (8/2 x 11 inch), one side only, entirely double 
spaced with margins of at least 1/2 inches on all sides. 

The manuscripts should be divided into numbered sections with suitable verbal titles. 

The name and address of each author should be given as a footnote on the first page of the manuscript. 
Acknowledgements should appear at the end of the text. 

Any appendix should be placed after the acknowledgements but before the list of references. 


Abstract 


The manuscript should begin with an abstract consisting of one paragraph followed by three to six key words. Avoid 
mathematical expressions in the abstract. 
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Style 


Avoid footnotes, abbreviations, and acronyms. 

Mathematical symbols will be italicized unless specified otherwise except for functional symbols such as “exp(-)” 
and “‘log(-)”’, etc. 

Short formulae should be left in the text but everything in the text should fit in single spacing. Long and important 
equations should be separated from the text and numbered consecutively with arabic numerals on the right if they are 
to be referred to later. 

Write fractions in the text using a solidus. 

Distinguish between ambiguous characters, (e.g., w, ; 0, O, 0; 1, 1). 

Italics are used for emphasis. Indicate italics by underlining on the manuscript. 


Figures and Tables 


All figures and tables should be numbered consecutively with arabic numerals, with titles which are as nearly self 
explanatory as possible, at the bottom for figures and at the top for tables. 

They should be put on separate pages with an indication of their appropriate placement in the text. (Normally they 
should appear near where they are first referred to). 


References 


References in the text should be cited with authors’ names and the date of publication. If part of a reference is cited, 
indicate after the reference, e.g., Cochran (1977, page 164). 

The list of references at the end of the manuscript should be arranged alphabetically and for the same author 
chronologically. Distinguish publications of the same author in the same year by attaching a, b, c to the year of 
publication. Journal titles should not be abbreviated. Follow the same format used in recent issues. 
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In This Issue 


It is with great sadness that we note the recent passing of M.P. Singh, Editor of the Survey Methodology 
journal since the very first issue in 1975. This issue of the journal opens with a brief obituary in memoriam. 

This issue of Survey Methodology also contains the fifth paper in the annual invited paper series in 
honour of Joseph Waksberg. A short biography of Joseph Waksberg was given in the June 2001 issue 
of the journal, along with the first paper in the series. I would like to thank the members of the 
selection committee- Michael Brick, chair, David Bellhouse, Gordon Brackstone and Paul Biemer — 
for having selected Jon Rao as the author of this year’s Waksberg paper. 

In his paper entitled “Interplay Between Sample Survey Theory and Practice: An Appraisal”, Rao 
traces how survey methods are stimulated by new theoretical developments, and how theory is 
challenged by survey practice. After summarizing fifty years of contributions from 1920 to 1970, he 
presents more detailed discussions of more recent developments in several areas. Finally, he discusses 
several examples of important theory that is not yet widely applied in practice. 

In their paper, Fuller and Kim develop and study an efficient hot-deck imputation method under the 
assumption that response probabilities are equal within imputation cells. Their proposed method is 
based on the idea of fractional imputation and uses regression techniques to obtain an approximation 
of the fully efficient version of fractional imputation. Variance estimation is developed for replication 
methods. Their proposed method is shown to work well in a simulation study. 

The paper by Brick, Jones, Kalton and Valliant compares through a simulation study three variance 
estimation methods in the presence of hot-deck imputation: the model-assisted method, the adjusted 
jackknife method and multiple imputation. The goal of the simulation study is to study the properties 
of these variance estimators when their underlying assumptions do not hold. They found that the 
coverage rate of confidence intervals is not close to the nominal level when the point estimates are 
biased due failure to take into account the domains of interest at the imputation stage. They conclude 
by noting that the differences between the variance estimators were too small and inconsistent to 
support claims that any one of them is superior in general. 

Little and Vartivarian study the effect of nonresponse weighting on the Mean Squared Error (MSE) 
of a population mean estimator. Nonresponse weighting adjustments are obtained by adjusting design 
weights by the inverse of response rates within cells. They come to the conclusion that a covariate 
must have two characteristics to reduce nonresponse bias: it needs to be related to both the probability 
of response and to the survey outcome. If the latter is true, nonresponse weighting can also reduce 
nonresponse variance. Estimates of the MSE are proposed and used to define a composite estimator. 
This composite estimator worked well when evaluated in a simulation study. 

O’ Malley and Zaslavsky present generalized variance-covariance modeling functions (GVCFs) for 
multivariate means of ordinal survey items, for both complete data and data with structured non- 
response. After developing and evaluating their methods, they give an illustration using data from the 
Consumer Assessments of Health Plans Study. In the concluding section they discuss some issues 
related to the application of GVCFs. 

The paper by Singh, Shukla and Kundu develops spatial and spatial-temporal models for small area 
estimation, as well as estimation of the MSE of the resulting EBLUPs. The models are applied to 
monthly per capita consumption expenditure data, and they conclude that the models can be very 
effective when there are significant correlations due to neighborhood effects. 

Belsby, Bjgrnstad and Zhang discuss modeling to estimate the number of households of different 
sizes when there is nonignorable nonresponse. They model the response mechanism conditional on 
household size, using registered family size as supplementary data. After developing their modeling 
approach, they produce and evaluate estimates using data from the 1992 Norwegian Consumer 
Expenditure Survey. 

Nandram, Cox and Choi consider an analysis for categorical data from a single two-way table with 
both item and unit nonresponse or, in their terminology, partial classification. They propose to use a 
Bayesian approach for modeling different patterns of missingness under ignorability and non- 
ignorability assumptions. The methods are illustrated using incompletely-observed bivariate data from 
the National Health and Nutrition Examination Survey where the variables subject to missingness are 
bone mineral density and family income. 


AZ In This Issue 


In the first of three short notes in this issue, Beaumont discusses the use of data collection process 
information in nonresponse weight adjustment. He then presents an example from the Canadian 
Labour Force Survey using the number of attempts to contact a survey unit. An important result is that 
if the collection process information can be treated as random, then this approach does not introduce 
any bias. 

Starting from basic principles, Bustos derives an explicit form for the probability function of an 
ordered sample. Using this function, he shows how it can be used to compute inclusion probabilities 
with illustrations for common sample designs. Finally, he gives the general form for the correlation 
matrix of sample units, which depends solely on the inclusion probabilities. 

Finally, the paper by Wu briefly reviews some theory about the Pseudo Empirical Likelihood (PEL) 
method in survey sampling, and presents algorithms for computing maximum PEL estimators and for 
constructing PEL ratio confidence intervals. Functions using the statistical software R and S-PLUS are 
given to help implement these algorithms in real surveys or in simulation studies. 


Harold Mantel 
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In Memoriam 


M.P. Singh 
(1941-2005) 


Dr. Mangala P. Singh was born in India on December 
IG" 1941 and received his PhD in 1969 from the Indian 
Statistical Institute, with a specialization in survey sampling. 
He joined Statistics Canada in 1970, where he rose to the 
position of Director of Household Survey Methods Division 
in 1994, a position he held at his death on August aA 2005. 

M.P., as he was known to everyone, was a leading figure 
in the application of statistical methods at Statistics Canada. 
He was probably most closely associated with the Labour 
Force Survey, one of the agency’s most important surveys. 
He directed the methodology of the LFS through redesigns 
in the 1970s, 1980s, 1990s and early 
21 century, introducing innovations at 
every turn, but always ensuring that 
changes were well-tested and sound. 
In the later years of his career, he also 
oversaw the development of several 
new and innovative health surveys and 
directed the development of statistical 
programs in the areas of household 
expenditures, education and justice. 

M.P.’s role as the Editor-in-Chief of 
the journal Survey Methodology had a 
transformative effect on the profession 
of survey methodology, both in 
Canada and abroad. M.P. was the 
founding editor of the journal, and for 
30 years he guided its evolution into a 
flagship publication of Statistics Can- 
ada. Thanks to his ability to attract a 
stellar team of associate editors and contributors, Survey 
Methodology is now recognized as one of the pre-eminent 
journals of its kind in the world. Even in recent years, M.P. 
continued to introduce innovations such as the Waksberg 
series of papers and electronic publishing. 

M.P. was a source of many other “big ideas” throughout 
his career at Statistics Canada. During the 1970s he was 
instrumental in gaining support for the idea of stable 
funding for methodology research, and he _ personally 
chaired the Methodology Research and Development 


Committee in its formative years. He encouraged numerous 
researchers and went out of his way to make them feel at 
home at Statistics Canada. Turning 60 did not stem the flow 
of ideas in any way. M.P. devoted considerable energy in 
the past four years to his proposal for a major overhaul of 
the way household surveys are conducted in Canada. As a 
result of his efforts, people throughout Statistics Canada are 
working on ways to implement his vision, and his influence 
on Canada’s household surveys will be felt for many years. 

M.P. had a special love for statistical research and for 
statistics as a profession. He personally authored over 40 
papers in international journals, co- 
edited two books published by Wiley 
and Sons, and organized sessions and 
presented papers at numerous statis- 
tical conferences. He served on vari- 
ous committees and task forces of the 
Statistical Society of Canada, the 
International Statistical Institute and 
the American Statistical Association. 
He also served as Secretary of Sta- 
tistics Canada’s external Advisory 
Committee on Statistical Methods. In 
turn, the profession honoured him; he 
was elected to the International Sta- 
tistical Institute in 1975, and in 1988 
he became a Fellow of the American 
Statistical Association. 

However it is his influence on an 
entire generation of statisticians that 
may be his greatest legacy. He was a mentor, a coach, a 
patriarch and a friend to all who knew him. He inspired 
others to give their best, and they did. He was always ready 
with a laugh, a smile and a friendly word of encouragement. 
He dedicated his life to the profession of statistics and it is 
through those whom he touched that his true contribution is 
measured. 

He is survived by his wife Savitri, his two daughers Mala 
and Mamta, and his son Rahul. 
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Waksberg Invited Paper Series 


The journal Survey Methodology has established an annual invited paper series in honour of Joseph 
Waksberg, who has made many important contributions to survey methodology. Each year a prominent 
survey researcher is chosen to author an article as part of the Waksberg Invited Paper Series. The paper 
reviews the development and current state of a significant topic within the field of survey methodology, and 
reflects the mixture of theory and practice that characterized Waksberg’s work. The author receives a cash 
award made possible by a grant from Westat, in recognition of Joe Waksberg’s contributions during his 
many years of association with Westat. The grant is administered financially by the American Statistical 
Association. Previous winners were Gad Nathan, Wayne Fuller, Tim Holt, Norman Bradburn, Jon Rao, and 
Alastair Scott. The first five papers in the series have already appeared in Survey Methodology. 


Previous Waksberg Award Winners: 


Gad Nathan (2001) 
Wayne A. Fuller (2002) 
Tim Holt (2003) 

Norman Bradburn (2004) 
J.N.K. Rao (2005) 


Nominations: 


The author of the 2007 Waksberg paper will be selected by a four-person committee appointed by Survey 
Methodology and the American Statistical Association. Nominations of individuals to be considered as 
authors or suggestions for topics should be sent to the chair of the committee, Gordon Brackstone, 78 
Charing Road, Ottawa, Ontario, Canada, K2G 4C9, by email to Gordon.brackstone @ sympatico.ca or by fax 
1-613-951-1394. Nominations and suggestions for topics must be received by February 28, 2006. 


2005 Waksberg Invited Paper 
Author: J.N.K. Rao 


J.N.K. Rao is Distinguished Research Professor at Carleton University, Ottawa. He has published many 
articles on a wide range of topics in survey sampling theory and methods and he is the author of the 2003 
Wiley book “Small Area Estimation”. His research interests in survey sampling include analysis of survey 
data, small area estimation, missing data and imputation, re-sampling methods and empirical likelihood 
inference. His 1981 JASA paper (with A.J. Scott) on analysis of survey data was selected as a landmark 
paper in survey sampling theory and methods. He has been a Member of the Advisory Committee on 
Statististical Methods of Statistics Canada since its inception 20 years ago. He is a Fellow of the Royal 
Society of Canada and received the 1994 Gold Medal of the Statistical Society of Canada. 


116 Waskberg Invited Paper Series 
Members of the Waskberg Paper Selection Committee (2005-2006) 


Gordon Brackstone, (Chair) 
Wayne Fuller, Jowa State University 
Sharon Lohr, Arizona State University 


Past Chairs: 


Graham Kalton (1999 - 2001) 
Chris Skinner (2001 - 2002) 
David A. Binder (2002 - 2003) 

J. Michael Brick (2003 - 2004) 
David R. Bellhouse (2004 - 2005) 
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Interplay Between Sample Survey Theory and Practice: An Appraisal 


J.N.K. Rao '! 


Abstract 


A large part of sample survey theory has been directly motivated by practical problems encountered in the design and 
analysis of sample surveys. On the other hand, sample survey theory has influenced practice, often leading to significant 
improvements. This paper will examine this interplay over the past 60 years or so. Examples where new theory is needed or 


where theory exists but is not used will also be presented. 


Key Words: Analysis of survey data; Early contributions; Inferential issues; Re-sampling methods; Small area 


estimation. 


1. Introduction 


In this paper I will examine the interplay between sample 
survey theory and practice over the past 60 years or so. I 
will cover a wide range of topics: early landmark contri- 
butions that have greatly influenced practice, inferential 
issues, calibration estimation that ensures consistency with 
user specified totals of auxiliary variables, unequal proba- 
bility sampling without replacement, analysis of survey 
data, the role of resampling methods, and small area esti- 
mation. I will also present some examples where new theory 
is needed or where theory exists but is not used widely. 


2. Some Early Landmark Contributions: 
1920-1970 


This section gives an account of some early landmark 
contributions to sample survey theory and methods that 
have greatly influenced the practice. The Norwegian statis- 
tician A.N. Kiaer (1897) is perhaps the first to promote sam- 
pling (or what was then called “the representative method’’) 
over complete enumeration, although the oldest reference to 
sampling can be traced back to the great Indian epic 
Mahabharata (Hacking 1975, page 7). In the representative 
method the sample should mirror the parent finite 
population and this may be achieved either by balanced 
sampling through purposive selection or by random sam- 
pling. The representative method was used in Russia as 
early as 1900 (Zarkovic 1956) and Wright conducted 
sample surveys in the United States around the same period 
using this method. By the 1920s, the representative method 
was widely used, and the International Statistical Institute 
played a prominent role by creating a committee in 1924 to 
report on the representative method. This committee’s re- 
port discussed theoretical and practical aspects of the ran- 
dom sampling method. Bowley’s (1926) contribution to this 
report includes his fundamental work on stratified random 


sampling with proportional allocation, leading to a represen- 
tative sample with equal inclusion probabilities. Hubback 
(1927) recognized the need for random sampling in crop 
surveys: “The only way in which a satisfactory estimate can 
be found is by as close an approximation to random 
sampling as the circumstances permit, since that not only 
gets rid of the personal limitations of the experimenter but 
also makes it possible to say what is the probability with 
which the results of a given number of samples will be 
within a given range from the mean. To put this into definite 
language, it should be possible to find out how many 
samples will be required to secure that the odds are at least 
20:1 on the mean of the samples within one maund of the 
true mean”. This statement contains two important obser- 
vations on random sampling: (1). It avoids personal biases 
in sample selection. (2). Sample size can be determined to 
satisfy a specified margin of error apart from a chance of 1 
in 20. Mahalanobis (1946b) remarked that R.A. Fisher’s 
fundamental work at Rothamsted Experimental Station on 
design of experiments was influenced directly by Hubback 
G92 }): 

Neyman’s (1934) classic landmark paper laid the theo- 
retical foundations to the probability sampling (or design- 
based) approach to inference from survey samples. He 
showed, both theoretically and with practical examples, that 
stratified random sampling is preferable to balanced sam- 
pling because the latter can perform poorly if the underlying 
model assumptions are violated. Neyman also introduced 
the ideas of efficiency and optimal allocation in his theory 
of stratified random sampling without replacement by 
relaxing the condition of equal inclusion probabilities. By 
generalizing the Markov theorem on least squares esti- 
mation, Neyman proved that the stratified mean, y,, = 
>,W,,¥,, is the best estimator of the population mean, 
Y =>,W,Y,, in the linear class of unbiased estimators of 
the form y, =>,W, d:));¥,;. where W,,y, and Y, are 


1. J.N.K. Rao, School of Mathematics and Statistics, Carleton University, Ottawa, Ontario, Canada, K1S 5B6. 
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the h' stratum weight, sample mean and population mean 
(h=1,..., L), and b,, is a constant associated with the 
item value y,,observed on the i" sample draw (i= 
1, ..., 2, ) in the h™ stratum. Optimal allocation (n,, ..., 
n,) of the total sample size, n, was obtained by minimi- 
zing the variance of y, subject to >',n, =n; an earlier 
proof of Neyman allocation by Tschuprow (1923) was later 
discovered. Neyman also proposed inference from larger 
samples based on normal theory confidence intervals such 
that the frequency of errors in the confidence statements 
based on all possible stratified random samples that could be 
drawn does not exceed the limit prescribed in advance 
“whatever the unknown properties of the population”. Any 
method of sampling that satisfies the above frequency 
statement was called “representative”. Note that Hubback 
(1927) earlier alluded to the frequency statement associated 
with the confidence interval. Neyman’s final contribution to 
the theory of sample surveys (Neyman 1938) studied two- 
phase sampling for stratification and derived the optimal 
first phase and second phase sample sizes, n’ and n, by 
minimizing the variance of the estimator subject to a given 
cost C =n’c’+nc, where the second phase cost per unit, 
c, is large relative to the first phase cost per unit, c’. 

The 1930’s saw a rapid growth in demand for informa- 
tion, and the advantages of probability sampling in terms of 
greater scope, reduced cost, greater speed and model-free 
features were soon recognized, leading to an increase in the 
number and type of surveys taken by probability sampling 
and covering large populations. Neyman’s approach was 
almost universally accepted by practicing survey statis- 
ticians. Moreover, it inspired various important extensions, 
mostly motivated by practical and efficiency considerations. 
Cochran’s (1939) landmark paper contains several impor- 
tant results: the use of ANOVA to estimate the gain in effi- 
ciency due to stratification, estimation of variance compo- 
nents in two-stage sampling for future studies on similar 
material, choice of sampling unit, regression estimation 
under two-phase sampling and effect of errors in strata sizes. 
This paper also introduced the super-population concept: 
“The finite population should itself be regarded as a random 
sample from some infinite population’. It is interesting to 
note that Cochran at that time was critical of the traditional 
fixed population concept: “Further, it is far removed from 
reality to regard the population as a fixed batch of known 
numbers”. Cochran (1940) introduced ratio estimation for 
sample surveys, although an early use of the ratio estimator 
dates back to Laplace (1820). In another landmark paper 
(Cochran 1942), he developed the theory of regression 
estimation. He derived the conditional variance of the usual 
regression estimator for a fixed sample and also a sample 
estimator of this variance, assuming a linear regression 
model y=a+PBx+e, where e has mean zero and 
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constant variance in arrays in which x is fixed. He also 
noted that the regression estimator remains (model) unbi- 
ased under non-random sampling, provided the assumed 
linear regression model is correct. He derived the average 
bias under model deviations (in particular, quadratic regres- 
sion) for simple random sampling as the sample size n 
increased. Cochran then extended his results to weighted 
regression and derived the now well-known optimality 
result for the ratio estimator, namely it is a “best unbiased 
linear estimate if the mean value and variance both change 
proportional to x”. The latter model is called the ratio mod- 
el in the current literature. Madow and Madow (1944) and 
Cochran (1946) compared the expected (or anticipated) 
variance under a super-population model to study the 
relative efficiency of systematic sampling and stratified 
random sampling analytically. This paper stimulated much 
subsequent research on the use of super-population models 
in the choice of probability sampling strategies, and also for 
model-dependent and model-assisted inferences (see section 
ay 

In India, Mahalanobis made pioneering contributions to 
sampling by formulating cost and variance functions for the 
design of surveys. His 1944 landmark paper (Mahalanobis 
1944) provides deep theoretical results on the efficient de- 
sign of sample surveys and their practical applications, in 
particular to crop acreage and yield surveys. The well- 
known optimal allocation in stratified random sampling 
with cost per unit varying across strata is obtained as a 
special case of his general theory. As early as 1937, 
Mahalanobis used multi-stage designs for crop yield surveys 
with villages, grids within villages, plots within grids and 
cuts of different sizes and shapes as sampling units in the 
four stages of sampling (Murthy 1964). He also used a two- 
phase sampling design for estimating the yield of cinchona 
bark. He was instrumental in establishing the National 
Sample Survey (NSS) of India, the largest multi-subject 
continuing survey operation with full-time staff using 
personal interviews for socioeconomic surveys and physical 
measurements for crop surveys. Several prominent survey 
Statisticians, including D.B. Lahiri and M.N. Murthy, were 
associated with the NSS. 

P.V. Sukhatme, who studied under Neyman, also made 
pioneering contributions to the design and analysis of large- 
scale agricultural surveys in India, using stratified multi- 
stage sampling. Starting in 1942-1943 he developed 
efficient designs for the conduct of nationwide surveys on 
wheat and rice crops and demonstrated high degree of 
precision for state estimates and reasonable margin of error 
for district estimates. Sukhatme’s approach differed from 
that of Mahalanobis who used very small plots for crop 
cutting employing adhoc staff of investigators. Sukhatme 
(1947) and Sukhatme and Panse (1951) demonstrated that 
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the use of a small plot might give biased estimates due to the 
tendency of placing boundary plants inside the plot when 
there is doubt. They also pointed out that the use of an 
ad hoc staff of investigators, moving rapidly from place to 
place, forces the plot measurements on only those sample 
fields that are ready for harvest on the date of the visit, thus 
violating the principle of random sampling. Sukhatme’s 
solution was to use large plots to avoid boundary bias and to 
entrust crop-cutting work to the local revenue or agricultural 
agency in a State. 

Survey statisticians at the U.S. Census Bureau, under the 
leadership of Morris Hansen, William Hurwitz, William 
Madow and Joseph Waksberg, made fundamental contribu- 
tions to sample survey theory and practice during the period 
1940-70, and many of those methods are still widely used 
in practice. Hansen and Hurwitz (1943) developed the basic 
theory of stratified two-stage sampling with one primary 
sampling unit (PSU) within each stratum drawn with proba- 
bility proportional to size measure (PPS sampling) and then 
sub-sampled at a rate that ensures self-weighting (equal 
overall probabilities of selection) within strata. This ap- 
proach provides approximately equal interviewer work 
loads which is desirable in terms of field operations. It also 
leads to significant variance reduction by controlling the 
variability arising from unequal PSU sizes without actually 
stratifying by size and thus allowing stratification on other 
variables to reduce the variance. On the other hand, 
workloads can vary widely if the PSUs are selected by 
simple random sampling and then sub-sampled at the same 
rate within each stratum. PPS sampling of PSUs is now 
widely used in the design of large-scale surveys, but two or 
more PSUs are selected without replacement from each 
stratum such that the PSU inclusion probabilities are 
proportional to size measures (see section 5). 

Many large-scale surveys are repeated over time, such as 
the monthly Canadian Labour Force Survey (LFS) and the 
U.S. Current Population Survey (CPS), with partial replace- 
ment of ultimate units (also called rotation sampling). For 
example, in the LFS the sample of households is divided 
into six rotation groups (panels) and a rotation group re- 
mains in the sample for six consecutive months and then 
drops out of the sample, thus giving five-sixth overlap be- 
tween two consecutive months. Yates (1949) and Patterson 
(1950), following the initial work of Jessen (1942) for 
sampling on two occasions with partial replacement of 
units, provided the theoretical foundations for design and 
estimation of repeated surveys, and demonstrated the effi- 
ciency gains for level and change estimation by taking 
advantage of past data. Hansen, Hurwitz, Nisseslson and 
Steinberg (1955) developed simpler estimators, called K — 
composite estimators, in the context of stratified multi-stage 
designs with PPS sampling in the first stage. Rao and 


119 


Graham (1964) studied optimal replacement policies for the 
K —composite estimators. Various extensions have also 
been proposed. Composite estimators have been used in the 
CPS and other continuing large scale surveys. Only re- 
cently, the Canadian LFS adopted a type of composite esti- 
mation, called regression composite estimation, that makes 
use of sample information from previous months and that 
can be implemented with a regression weights program (see 
section 4). 

Keyfitz (1951) proposed an ingenious method of 
switching to better PSU size measures in continuing surveys 
based on the latest census counts. His method ensures that 
the probability of overlap with the previous sample of one 
PSU per stratum is maximized, thus reducing the field costs 
and at the same time achieving increased efficiency by using 
the better size measures in PPS sampling. The Canadian 
LFS and other continuing surveys have used the Keyfitz 
method. Raj (1956) formulated the optimization problem as 
a “transportation problem” in linear programming. Kish and 
Scott (1971) extended the Keyfitz method to changing strata 
and size measures. Ernst (1999) has given a nice account of 
the developments over the past 50 years in sample co- 
ordination (maximizing or minimizing the sample overlap) 
using transportation algorithms and related methods; see 
also Mach, Reiss and Schiopu-Kratina (2005) for appli- 
cations to business surveys with births and deaths of firms. 

Dalenius (1957, Chapter 7) studied the problem of opti- 
mal stratification for a given number of strata, L, under the 
Neyman allocation. Dalenius and Hodges (1959) obtained a 
simple approximation to optimal stratification, called the 
cum ./f rule, which is extensively used in practice. For 
highly skewed populations with a small number of units 
accounting for a large share of the total Y, such as business 
populations, efficient stratification requires one take-all 
stratum (n,=WN,) of big units and take-some strata of 
medium and small size units. Lavallée and Hidiroglou 
(1988) and Rivest (2002) developed algorithms for deter- 
mining the strata boundaries using power allocation (Fellegi 
1981; Bankier 1988) and Neyman allocation for the take 
some strata. Statistics Canada and other agencies currently 
use those algorithms for business surveys. 

The focus of research prior to 1950 was on estimating 
population totals and means for the whole population and 
large planned sub-populations, such as states or provinces. 
However, users are also interested in totals and means for 
unplanned sub-populations (also called domains) such as 
age-sex groups within a province, and parameters other than 
totals and means such as the median and other quantiles, for 
example median income. Hartley (1959) developed a 
simple, unified theory for domain estimation applicable to 
any design, requiring only the standard formulae for the 
estimator of total and its variance estimator, denoted in the 
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operator notation as Y(y) and v(y) respectively. He 
introduced two synthetic variables , y, and , a; which take 
the values y, and 1 respectively if the unit i belongs to 
domain j and equal to 0 otherwise. The estimators of do- 
main total ,Y =Y(,y) and domain size yh = EG a) are 
then simply obtained from the formulae for Y(y) and 
v(y) by replacing y, by , y, and , a, respectively. Sim- 
ilarly, estimators of domain means and domain differences 
and their variance estimators are obtained from the basic 
formulae for Y (y) and v(y). Durbin (1968) also obtained 
similar results. Domain estimation is now routinely done 
using Hartley’s ingenious method. 

For inference on quantiles, Woodruff (1952) proposed a 
simple and ingenious method of getting a (1— a) — level 
confidence interval under general sampling designs, using 
only the estimated distribution function and its standard 
error (see Lohr’s (1999) book, pages 311-313). Note that 
the latter are simply obtained from the formulae for a total 
by changing y to an indicator variable. By equating the 
Woodruff interval to a normal theory interval on the 
quantile, a simple formula for the standard error of the p™ 
quantile estimator may also be obtained as half the length of 
the interval divided by the upper @/2-—point of the 
standard NM(0, 1) distribution which equals 1.96 if a =0.05 
(Rao and Wu 1987; Francisco and Fuller 1991). A sur- 
prising property of the Woodruff interval is that it performs 
well even when p is small or large and sample size is 
moderate (Sitter and Wu 2001). 

The importance of measurement errors was realized as 
early as the 1940s. Mahalanobis’ (1946a) influential paper 
developed the technique of interpenetrating sub-samples 
(called replicated sampling by Deming 1960). This method 
was extensively used in large-scale sample surveys in India 
for assessing both sampling and measurement errors. The 
sample is drawn in the form of two or more independent 
sub-samples according to the same sampling design such 
that each sub-sample provides a valid estimate of the total or 
mean. The sub-samples are assigned to different interview- 
ers (or teams) which leads to a valid estimate of the total 
variance that takes proper account of the correlated response 
variance component due to interviewers. Interpenetrating 
sub-samples increase the travel costs of interviewers, but 
they can be reduced through modifictions of interviewer 
assignments. Hansen, Hurwitz, Marks and Mauldin (1951), 
Sukhatme and Seth (1952) and Hansen, Hurwitz and 
Bershad (1961) developed basic theories under additive 
measurement error models, and decomposed the total vari- 
ance into sampling variance, simple response variance and 
correlated response variance. The correlated response vari- 
ance due to interviewers was shown to be of the order k7! 
regardless of the sample size, where k is the number of 
interviewers. As a result, it can dominate the total variance 
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if k is not large. The 1950 U.S. Census interviewer vari- 
ance study showed that this component was indeed large for 
small areas. Partly for this reason, self-enumeration by mail 
was first introduced in the 1960 U.S. Census to reduce this 
component of the variance (Waksberg 1998). This is indeed 
a success story of theory influencing practice. Fellegi (1964) 
proposed a combination of interpenetration and replication 
to estimate the covariance between sampling and response 
deviations. This component is often neglected in the decom- 
position of total variance but it could be sizeable in practice. 

Yet another early milestone in sample survey methods is 
the concept of design effect (DEFF) due to Leslie Kish (see 
Kish 1965, section 8.2). The design effect is defined as the 
ratio of the actual variance of a statistic under the specified 
design to the variance that would be obtained under simple 
random sampling of the same size. This concept is espe- 
cially useful in the presentation and modeling of sampling 
errors, and also in the analysis of complex survey data 
involving clustering and unequal probabilities of selection 
(see section 6). 

We refer the reader to Kish (1995), Kruskal and 
Mosteller (1980), Hansen, Dalenius and Tepping (1985) and 
O’Muircheartaigh and Wong (1981) for reviews of early 
contributions to sample survey theory and methods. 


3. Inferential Issues 
3.1 Unified Design-Based Framework 


The development of early sampling theory progressed 
more or less inductively, although Neyman (1934) studied 
best linear unbiased estimation for stratified random 
sampling. Strategies (design and estimation) that appeared 
reasonable were entertained and relative properties were 
carefully studied by analytical and/or empirical methods, 
mainly through comparisons of mean squared errors, and 
sometimes also by comparing anticipated mean squared 
errors or variances under plausible super-population models, 
as noted in section 2. Unbiased estimation under a given 
design was not insisted upon because it “often results in 
much larger mean squared error than necessary” (Hansen, 
Hurwitz and Tepping 1983). Instead, design consistency 
was deemed necessary for large samples i.e., the estimator 
approches the population value as the sample size increases. 
Classical text books by Cochran (1953), Deming (1950), 
Hansen, Hurwitz and Madow (1953), Sukhatme (1954) and 
Yates (1949), based on the above approach, greatly influ- 
enced survey practice. Yet, academic statisticians paid little 
attention to traditional sampling theory, possibly because it 
lacked a formal theoretical framework and was not 
integrated with mainstream statistical theory. Numerous 
prestigious statistics departments in North America did not 
offer graduate courses in sampling theory. 
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Formal theoretical frameworks and approaches to inte- 
grating sampling theory with mainstream statistical infer- 
ence were initiated in the 1950s under a somewhat idealistic 
set-up that focussed on sampling errors assuming the ab- 
sence of measurement or response errors and non-response. 
Horvitz and Thompson (1952) made a basic contribution to 
sampling with arbitrary probabilities of selection by formu- 
lating three subclasses of linear design-unbiased estimators 
of a total Y that include the Markov class studied by 
Neyman as one of the subclasses. Another subclass with 
design weight d, attached to a sample unit 7 and depending 
only on i admitted the well-known estimator with weight 
inversely proportional to the inclusion probability 7; as the 
only unbiased estimator. Narain (1951) also discovered this 
estimator, so it should be called the Narain-Horvitz- 
Thompson (NHT) estimator rather than the HT estimator as 
it is commonly known. For simple random sampling, the 
sample mean is the best linear unbiased estimator (BLUE) 
of the population mean in the three subclasses, but this is not 
sufficient to claim that the sample mean is the best in the 
class of all possible linear unbiased estimators. Godambe 
(1955) proposed a general class of linear unbiased esti- 
mators of a total Y by recognizing the sample data as 
{(i, y;), i€ s} and by letting the weight depend on the 
sample unit i as well as on the other units in the sample s, 
that is, the weight is of the form d; (s). He then established 
that the BLUE does not exist in the general class 


Vay ds) ye (1) 


even under simple random sampling. This important neg- 
ative theoretical result was largely overlooked for about 10 
years. Godambe also established a positive result by relating 
y to a size measure x using a super-population regression 
model through origin with error variance proportional to 
x°, and then showing that the NHT estimator under any 
fixed sample size design with wm, proportional to x, 
minimizes the anticipated variance in the unbiased class (1). 
This result clearly shows the conditions on the design for the 
use of the NHT estimator. Rao (1966) recognized the lim- 
itations of the NHT estimator in the context of surveys with 
PPS sampling and multiple characteristics. Here the NHT 
estimator will be very inefficient when a characteristic y is 
unrelated or weakly related to the size measure x (such as 
poultry count y and farm size x in a farm survey). Rao 
proposed efficient alternative estimators for such cases that 
ignore the NHT weights. Ignoring the above results, some 
theoretical criteria were later advanced in the sampling 
literature to claim that the NHT estimator should be used for 
any sampling design. Using an amusing example of circus 
elephants, Basu (1971) illustrated the futility of such criteria. 
He constructed a “bad” design with 7, unrelated to y, and 
then demonstrated that the NHT estimator leads to absurd 
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estimates which prompted the famous mainstream Bayesian 
statistician Dennis Lindley to conclude that this counte- 
rexample destroys the design-based sample survey theory 
(Lindley 1996). This is rather unfortunate because NHT and 
Godambe clearly stated the conditions on the design for a 
proper use of the NHT estimator, and Rao (1966) and Hajek 
(1971) proposed alternative estimators to deal with multiple 
characteristics and bad designs, respectively. It is interesting 
to note that the same theoretical criteria led to a bad variance 
estimator of the NHT estimator as the ‘optimal’ choice (Rao 
and Singh 1973). 

Attempts were also made to integrate sample survey 
theory with mainstream statistical inference via the like- 
lihood function. Godambe (1966) showed that the likeli- 
hood function from the sample data {(i,y,), ies}, re- 
garding the N — vector of unknown y-—values as the para- 
meter, provides no information on the unobserved sample 
values and hence on the total Y. This uninformative feature 
of the likelihood function is due to the label property that 
treats the N population units as essentially N_post-strata. 
A way out of this difficulty is to take the Bayesian route by 
assuming informative (exchangeable) priors on the para- 
meter vector (Ericson 1969). An alternative route (design- 
based) is to ignore some aspects of the sample data to make 
the sample non-unique and thus arrive at an informative 
likelihood function (Hartley and Rao 1968; Royall 1968). 
For example, under simple random sampling, suppressing 
the labels 7 and regarding the data as {(i, y, ), i¢ s} in the 
absence of information relating 7 to y,, leads to the sample 
mean as the maximum likelihood estimator of the popu- 
lation mean. Bayesian estimation, assuming non-informa- 
tive prior distributions, leads to results similar to Ericson’s 
(1969) but depends on the sampling design unlike Ericson’s. 
In the case y, is a vector that includes auxiliary variables 
with known totals, Hartley and Rao (1968) showed that the 
maximum likelihood estimator under simple random sam- 
pling is approximately equal to the traditional regression 
estimator of the total. This paper was the first to show how 
to incorporate known auxiliary population totals in a like- 
lihood framework. For stratified random sampling, labels 
within strata are ignored but not strata labels because of 
known strata differences. The resulting maximum likelihood 
estimator is approximately equal to a pseudo-optimal linear 
regression estimator when auxiliary variables with known 
totals are available. The latter estimator has some good con- 
ditional design-based properties (see section 3.4). The focus 
of Hartley and Rao (1968) was on the estimation of a total, 
but the likelihood approach has much wider scope in sam- 
pling, including the estimation of distribution functions 
and quantiles and the construction of likelihood ratio 
based confidence intervals (see section 8.1). The Hartley- 
Rao non-parametric likelihood approach was discovered 
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independently twenty years later (Owen 1988) in the 
mainstream statistical inference under the name “empirical 
likelihood”. It has attracted a good deal of attention, 
including its application to various sampling problems. So 
in a sense the integration efforts with mainstream statistics 
were partially successful. Owen’s (2002) book presents a 
thorough account of empirical likelihood theory and its 
applications. 


3.2 Model-Dependent Approach 


The model-dependent approach to inference assumes that 
the population structure obeys a specified super-population 
model. The distribution induced by the assumed model 
provides inferences referring to the particular sample of 
units s that has been drawn. Such conditional inferences 
can be more relevant and appealing than repeated sampling 
inferences. But model-dependent strategies can perform 
poorly in large samples when the model is not correctly 
specified; even small deviations from the assumed model 
that are not easily detectable through model checking 
methods can cause serious problems. For example, consider 
the often-used ratio model when an auxiliary variable x 
with known total X is also measured in the sample: 


Ties ee Sana eee (2) 


where the €, are independent random variables with zero 
mean and variance proportional to x,. Assuming the model 
holds for the sample, that is, no sample selection bias, the 
best linear model-unbiased predictor of the total Y is given 
by the ratio estimator (y/x)X regardless of the sample 
design. This estimator is not design consistent unless the 
design is self-weighting, for example, stratified random 
sampling with proportional allocation. As a result, it can 
perform very poorly in large samples under non-self- 
weighting designs even if the deviations from the model are 
small. Hansen etal. (1983) demonstrated the poor perfor- 
mance under a repeated sampling set-up, using a stratified 
random sampling design with near optimal sample alloca- 
tion (commonly used to handle highly skewed populations). 
Rao (1996) used the same design to demonstrate poor 
performance under a conditional framework relevant to the 
model-dependent approach (Royall and Cumberland 1981). 
Nevertheless, model-dependent approaches can play a vital 
role in small area estimation where the sample size in a 
small area (or domain) can be very small or even zero; see 
section 7. 

Brewer (1963) proposed the model-dependent approach 
in the context of the ratio model (2). Royall (1970) and his 
collaborators made a systematic study of this approach. 
Valliant, Dorfman and Royall (2000) give a comprehensive 
account of the theory, including estimation of the (condi- 
tional) model variance of the estimator which varies with s. 
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For example, under the ratio model (2) the model variance 
depends on the sample mean X,. It is interesting to note 
that balanced sampling through purposive selection appears 
in the model-dependent approach in the context of protec- 
tion against incorrect specification of the model (Royall and 
Herson 1973). 


3.3 Model-Assisted Approach 


The model-assisted approach attempts to combine the 
desirable features of design-based and model-dependent 
methods. It entertains only design-consistent estimators of 
the total Y that are also model unbiased under the assumed 
“working” model. For example, under the ratio model (2), a 
model-assisted estimator of Y for a specified probability 
sampling design is given by the ratio estimator Ye = 
(Yuur /X yur )X which is design consistent regardless of 
the assumed model. Hansen et al. (1983) used this estimator 
for their stratified design to demonstrate its superior 
performance over the model dependent estimator (y/x)X. 
For variance estimation, the model-assisted approach uses 
estimators that are consistent for the design variance of the 
estimator and at the same time exactly or asymptotically 
model unbiased for the model variance. However, the infer- 
ences are design-based because the model is used only as a 
“working” model. 

For the ratio estimator Y, the variance estimator is given 
by 

Var(Y JX) Kee ee), (3) 


where in the operator notation v(e) is obtained from v(y) 
by changing y, to the residuals e, =y, —(Yyq_/ 
X yur )X;. This variance estimator is asymptotically equiv- 
alent to a customary linearization variance estimator v(e), 
but it reflects the fact that the information in the sample 
varies with X \y,: larger values lead to smaller variability 
and smaller values to larger variability. The resulting normal 
pivotal leads to valid model-dependent inferences under the 
assumed model (unlike the use of v(e) in the pivotal) and 
at the same time protects against model deviations in the 
sense of providing asymptotically valid design-based infer- 
ences. Note that the pivotal is asymptotically equivalent to 
Y(@)/v(é)]"” with &@ =y, -—(Y/X)x,. If the devia- 
tions from the model are not large, then the skewness in the 
residuals @, will be small even if y, and x, are highly 
skewed, and normal confidence intervals will perform well. 
On the other hand, for highly skewed populations, the 
normal intervals based on Y,,,; and its standard error may 
perform poorly under repeated sampling even for fairly 
large samples because the pivotal depends on the skewness 
of the y,. Therefore, the population structure does matter in 
design-based inferences contrary to the claims of Neyman 
(1934), Hansen etal. (1983) and others. Rao, Jocelyn and 
Hidiroglou (2003) considered the simple linear regression 
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estimator under two-phase simple random sampling with 
only x observed in the first phase. They demonstrated that 
the coverage performance of the associated normal intervals 
can be poor even for moderately large second phase samples 
if the true underlying model that generated the population 
deviated significantly from the linear regression model (for 
example, a quadratic regression of y on x) and the 
skewness of x is large. In this case, the first phase x— 
values are observed, and a proper model-assisted approach 
would use a multiple linear regression estimator with x and 
z=.x~ as the auxiliary variables. Note that for single phase 
sampling such a model-assisted estimator cannot be imple- 
mented if only the total X is known since the estimator 
depends on the population total of z. 

Samdal, Swenson and Wretman (1992) provide a com- 
prehensive account of the model-assisted apporach to esti- 
mating the total Y of a variable y under the working linear 
regression model 


pea ink ta pel ee (4) 


with mean zero, uncorrelated errors €; and model variance 
V,, (€;)=6°q, =0; where the gq, are known constants 
and the x—vectors have known totals X (the population 
values x,,..., X, may not be known). Under this set-up, 
the model-assisted approach leads to the generalized regres- 
sion (GREG) estimator with a closed-form expression 

Ke Sirun ie ice Need vow, ( yydquen 6) 


where 


Beep caliph Peo ) (6) 
with T= >,m;'x,x//q, is a weighted regression coeffi- 
cient, and w,(s)=g,(s)m;' with g,(s)=1+(X-X yar)’ 
T~'x,/q,, known as “ g — weights”. Note that the GREG 
estimator (5) can also be written as icy 9; + E ne nete 
$, =x/B is the predictor of y, under the working model 
and Ey, is the NHT estimator of the total prediction error 
E=ieve; with e; =y, —J,. This representation shows 
the role of the working model in the model-assisted 
approach. The GREG estimator (5) is design-consistent as 
well as model-unbiased under the working model (4). 
Moreover, it is nearly “optimal” in the sense of minimizing 
the asymptotic anticipated MSE (model expectation of the 
design MSE) under the working model, provided the 
inclusion probability, 7m,, is proportional to the model 
standard deviation o,. However, in surveys with multiple 
variables of interest, the model variance may vary across 
variables. Because one must use a general-purpose design 
such as the design with inclusion probabilities proportional 
to sizes, the optimality result no longer holds, even if the 
same vector x, is used for all the variables y, in the 
working model. 
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The GREG estimator simplifies to the ‘projection’ esti- 
mator XB =)>),w,(s)y, with.g.(s)=X T7'x,/qpif the 
model variance 6? is proportional to A’x, for some A. 
The ratio estimator is obtained as a special case of the pro- 
jection estimator by letting g; = x,, leading to g;(s)= 
X/X 4. Note that the GREG estimator (5) requires only 
the population totals X and not necessarily the individual 
population values x,;. This is very useful because the 
auxiliary population totals are often ascertained from exter- 
nal sources such as demographic projections of age and sex 
counts. Also, it ensures consistency with the known totals 
X in the sense of >),w; (s)x; =X. Because of this prop- 
erty, GREG is also a calibration estimator. 

Suppose there are p variables of interest, say y"’’, ..., 
y‘”), and we want to use the model-assisted approach to 
estimate the corresponding population totals Y“”, ..., Y°”?. 
Also, suppose that the working model for y‘/’ is of the 
form (4) but requires possibly different x—vector x‘/’ 
with known total X ‘/’ foreach j=1, ..., p: 


Pra a ene Fal NG (7) 


In this case, the g — weights depend on j and in turn the 
final weights w,(s) also depend on j. In practice, it is 
often desirable to use a single set of final weights for all the 
p variables to ensure internal consistency of figures when 
aggregated over different variables. This property can be 
achieved only by enlarging the x— vector in the model (7) 
to accommodate all the variables y‘/’, say X with known 
total X¥ and then using the working model 


/ 
yen oe ren) Tene (8) 


However, the resulting weighted regression coefficients 
could become unstable due to possible multicolinearity in 
the enlarged set of auxiliary variables. As a result, the 
GREG estimator of Y‘/’ under model (8) is less efficient 
compared to the GREG estimator under model (7). More- 
over, some of the resulting final weights, say w;(s), may 
not satisfy range restrictions by taking either values smaller 
than 1 (including negative values) or very large positive 
values. A possible solution to handle this problem is to use a 
generalized ridge regression estimator of Y‘/’ that is 
model-assisted under the enlarged model (Chambers 1996; 
Rao and Singh 1997). 

For variance estimation, the model-assisted approach 
attempts to used design-consistent variance estimators that 
are also model-unbiased (at least for large samples) for the 
conditional model variance of the GREG estimator. De- 
noting the variance estimator of the NHT estimator of Y by 
v(y) in an operator notation, a simple Taylor linearization 
variance estimator satisfying the above property is given by 
v(ge), where v(ge) is obtained by changing y, to 
g,(s)e, in the formula for v(y); see Hidiroglou, Fuller 
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and Hickman (1976) and Sarndal, Swenson and Wretman 
(1989). 

In the above discussion, we have assumed a working lin- 
ear regression model for all the variables y‘/’. But in prac- 
tice a linear regression model may not provide a good fit for 
some of the y—variables of interest, for example, a binary 
variable. In the latter case, logistic regression provides a 
suitable working model. A general working model that cov- 
ers logistic regression is of the form E,, (y;)= h( Ba by ee 
u,, where h(.) could be non-linear; model (5) is a special 
case with h(a)=a. A model-assisted estimator of the total 
under the general working model is the difference estimator 
Yuar + Duh; -D57; f;, where fl; = h(x;B) and B is 
an estimator of the model parameter B. It reduces to the 
GREG estimator (5) if h(a)=a. This difference estimator 
is nearly optimal if the inclusion probability 7, is pro- 
portional to 0,, where Gn denotes the model variance, 
Vy Py 

GREG estimators have become popular among users 
because many of the commonly used estimators may be 
obtained as special cases of (5) by suitable specifications of 
x, and q;. A Generalized Estimation System (GES) based 
on GREG has been developed at Statistics Canada. 

Kott (2005) has proposed an alternative paradigm in- 
ference, called the randomization-assisted model-based ap- 
proach, which attempts to focus on model-based inference 
assisted by randomization (or repeated sampling). The def- 
inition of anticipated variance is reversed to the ran- 
domization-expected model variance of an estimator, but it 
is identical to the customary anticipated variance when the 
working model holds for the sample, as assumed in the 
paper. As a result, the choices of estimator and variance 
estimator are often similar to those under the model-assisted 
approach. However, Kott argues that the motivation is 
clearer and “the approach proposed here for variance 
estimation leads to logically coherent treatment of finite 
population and small-sample adjustments when needed”. 


3.4 Conditional Design-Based Approach 


A conditional design-based approach has also been 
proposed. This approach attempts to combine the condi- 
tional features of the model-dependent approach with the 
model-free features of the design-based approach. It allows 
us to restrict the reference set of samples to a “relevant” 
subset of all possible samples specified by the design. 
Conditionally valid inferences are obtained in the sense that 
the conditional bias ratio (i.e., the ratio of conditional bias to 
conditional standard error) goes to zero as the sample size 
increases. Approximately 100(1—0)% of the realized con- 
fidence intervals in repeated sampling from the conditional 
set will contain the unknown total Y. 
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Holt and Smith (1979) provide compelling arguments in 
favour of conditional design based inference, even though 
the discussion was confined to one-way post-stratification of 
a simple random sample in which case it is natural to make 
inferences conditional on the realized strata sample sizes. 
Rao (1992, 1994) and Casady and Valliant (1993) studied 
conditional inference when only the auxiliary total X is 
known from external sources. In the latter case, conditioning 
on the NHT estimator X 4; may be reasonable because it 
is “approximately” an ancillary statistic when X is known 
and the difference X \,,, — X provides a measure of imbal- 
ance in the realized sample. Conditioning on X y,,; leads to 
the “optimal” linear regression estimator which has the 
same form as the GREG estimator (5) with B given by (6) 
replaced by the estimated optimal value Boos of the regres- 
sion coefficient which involves the estimated covariance of 
Your and X yy, and the estimated variance of X yy. This 
optimal estimator leads to conditionally valid design-based 
inferences and model-unbiased under the working model 
(4). It is also a calibration estimator depending only on the 
total X and it can be expressed as }.,W;(s)y; with 
weights W,(s)=d,;g;(s) and the calibration factor 2; (s) 
depending only on the total X and the sample x -— values. 
It works well for stratified random sampling (commonly 
used in establishment surveys). However, Bow can become 
unstable in the case of stratified multistage sampling unless 
the number of sample clusters minus the number of strata is 
fairly large. The GREG estimator does not require the latter 
condition but it can perform poorly in terms of conditional 
bias ratio and conditional coverage rates, as shown by Rao 
(1996). The unbiased NHT estimator can be very bad condi- 
tionally unless the design ensures that the measure of imbal- 
ance as defined above is small. For example, in the Hansen 
et al. (1983) design based on efficient x — stratification, the 
imbalance is small and the NHT estimator indeed performed 
well conditionally. 

Tillé (1998) proposed an NHT estimator of the total Y 
based on approximate conditional inclusion probabilities 
given X y4;- His method also leads to conditionally valid 
inferences, but the estimator is not calibrated to X unlike 
the “optimal” linear regression estimator. Park and Fuller 
(2005) proposed a calibrated GREG version based on 
Tillé’s estimator which leads to non-negative weights more 
often than GREG. 

I believe practitioners should pay more attention to 
conditional aspects of design-based inference and seriously 
consider the new methods that have been proposed. 

Kalton (2002) has given compelling arguments for fa- 
voring design-based approaches (possibly model-assisted 
and/or conditional) for inference on finite population de- 
scriptive parameters. Smith (1994) named design-based 
inference as “procedural inference” and argued that 
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procedural inference is the correct approach for surveys in 
the public domain. We refer the reader to Smith (1976) and 
Rao and Bellhouse (1990) for reviews of inferential issues 
in sample survey theory. 


4. Calibration Estimators 


Calibration weights w, (s) that ensure consistency with 
user-specified auxiliary totals X are obtained by adjusting 
the design weights d, =1;' to satisfy the benchmark con- 
straints })j.,w, (s)x, =X. Estimators that use calibration 
weights are called calibration estimators and they use a 
single set of weights {w,;(s)} for all the variables of 
interest. We have noted in section 3.4 that the model- 
assisted GREG estimator is a calibration estimator, but a 
calibration estimator may not be model-assisted in the sense 
that it could be model-biased under a working model (4) 
unless the x — variables in the model exactly match the 
variables corresponding to the user-specified totals. For 
example, suppose the working model suggested by the data 
is a quadratic in a scalar variable x while the user-specified 
total is only its total X. The resulting calibration estimator 
can perform poorly even in fairly large samples, as noted in 
section 3.3, unlike the model-assisted GREG estimator 
based on the working quadratic model that requires the 
population total of the quadratic variables x? in addition to 
X. 

Post-stratification has been extensively used in practice 
to ensure consistency with known cell counts corresponding 
to a post-stratification variable, for example counts in dif- 
ferent age groups ascertained from external sources such as 
demographic projections. The resulting post-stratified esti- 
mator is a calibration estimator. Calibration estimators that 
ensure consistency with known marginal counts of two or 
more post-stratification variables have also been employed 
in practice; in particular raking ratio estimators that are 
obtained by benchmarking to the marginal counts in turn 
until convergence is approximately achieved, typically in 
four or less iterations. Raking ratio weights w,(s) are 
always positive. In the past, Statistics Canada used raking 
ratio estimators in the Canadian Census to ensure consis- 
tency of 2B-item estimators with known 2A—item counts. 
In the context of the Canadian Census, Brackstone and Rao 
(1979) studied the efficiency of raking ratio estimators and 
also derived Taylor linearization variance estimators when 
the number of iterations is four or less. Raking ratio 
estimators have also been employed in the U.S. Current 
Population Survey (CPS). It may be noted that the method 
of adjusting cell counts to given marginal counts in a two- 
way table was originally proposed in the landmark paper by 
Deming and Stephan (1940). 
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Unified approaches to calibration, based on minimizing a 
suitable distance measure between calibration weights and 
design weights subject to benchmark constraints, have 
attracted the attention of users due to their ability to accom- 
modate arbitrary number of user-specified benchmark con- 
straints, for example, calibration to the marginal counts of 
several post-stratification variables. Calibration software is 
also readily available, including GES (Statistics Canada), 
LIN WEIGHT (Statistics Netherlands), CALMAR (INSEE, 
France) and CLAN97 (Statistics Sweden). 

A chi-squared distance, ¥,.,q,(d; —w, )° /d;, leads to 
the GREG estimator (5), where the x—vector corresponds to 
the user-specified benchmark constraints (BC) and w, (s) 
is denoted as w, for simplicity (Huang and Fuller 1978; 
Deville and Sandal 1992). However, the resulting cal- 
ibration weights may not satisfy desirable range restrictions 
(RR), for example some weights may be negative or too 
large especially when the number of constraints is large and 
the variability of the design weights is large. Huang and 
Fuller (1978) proposed a scaled modified chi-squared 
distance measure and obtained the calibration weights 
through an iterative solution that satisfies BC at each 
iteration. However, a solution that satisfies BC and RR may 
not exist. Another method, called shrinkage minimization 
(Singh and Mohl 1996) has the same difficulty. Quadratic 
programming methods that minimize the chi-squared 
distance subject to both BC and RR have also been pro- 
posed (Hussain 1969) but the feasible set of solutions satis- 
fying both BC and RR can be empty. Alternative methods 
propose to change the distance function (Deville and 
S4mdal 1992) or drop some of the BC (Bankier, Rathwell 
and Majkowski 1992). For example, an information dis- 
tance of the form >);.,9; {w; log(w; /d; )—w; +d; } gives 
raking ratio estimators with non-negative weights w., but 
some of the weights-can be excessively large. “Ridge” 
weights obtained by minimizing a penalized chi-squared 
distance have also been proposed (Chambers 1996), but no 
guarantee that either BC or RR are satisfied, although the 
weights are more stable than the GREG weights. Rao and 
Singh (1997) proposed a “ridge shrinkage” iterative method 
that ensures convergence for a specified number of 
iterations by using a built-in tolerance specification to relax 
some BC while satisfying RR. Chen, Sitter and Wu (2002) 
proposed a similar method. 

GREG calibration weights have been used in the 
Canadian Labour Force Survey and more recently it has 
been extended to accommodate composite estimators that 
make use of sample information in previous months, as 
noted in section 2 (Fuller and Rao 2001; Gambino, Kennedy 
and Singh 2001; Singh, Kennedy and Wu 2001). GREG- 
type calibration estimators have also been used for the 
integration of two or more independent surveys from the 
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same population. Such estimators ensure consistency be- 
tween the surveys, in the sense that the estimates from the 
two surveys for common variables are identical, as well as 
benchmarking to known population totals (Renssen and 
Nieuwenbroek 1997; Singh and Wu 1996; Merkouris 2004). 
For the 2001 Canadian Census, Bankier (2003) studied cali- 
bration weights corresponding to the “optimal” linear re- 
gression estimator (section 3.3) under stratified random 
sampling. He showed that the “optimal” calibration method 
performed better than the GREG calibration used in the 
previous census, in the sense of allowing more BC to be 
retained while at the same time allowing the calibration 
weights to be at least one. The “optimal” calibration weights 
can be obtained from GES software by including the known 
strata sizes in the BC and defining the tuning constant q;, 
suitably. Note that the “optimal” calibration estimator also 
has desirable conditional design properties (section 3.4). 
Weighting for the 2001 Canadian census switched from 
projection GREG (used in the 1996 census) to “optimal” 
linear regression. 

Demnati and Rao (2004) derived Taylor linearization 
variance estimators for a general class of calibration esti- 
mators with weights w, =d,F (x/A), where the LaGrange 
multiplier i is determined by solving the calibration 
constraints. The choice F(a)=1+a gives GREG weights 
and F(a)=e" leads to raking ratio weights. In the special 
case of GREG weights, the variance estimator reduces to 
v(ge) given in section 3.3. 

We refer the reader to the Waksberg award paper of 
Fuller (Fuller 2002) for an excellent overview and appraisal 
of regression estimation in survey sampling, including 
calibration estimation. 


5. Unequal Probability Sampling 
Without Replacement 


We have noted in section 2 that PPS sampling of PSUs 
within strata in large-scale surveys was practically moti- 
vated by the desire to achieve approximately equal work- 
loads. PPS sampling also achieves significant variance re- 
duction by controlling on the variability arising from un- 
equal PSU sizes without actually stratifying by size. PSUs 
are typically sampled without replacement such that the 
PSU inclusion probability, 7, is proportional to PSU size 
measure x;. For example, systematic PPS sampling, with 
or without initial randomization of the PSU labels, is an 
inclusion probability proportional to size (IPPS) design (also 
called mPS design) that has been used in many complex 
surveys, including the Canadian LFS. The estimator of a 
total associated with an IPPS design is the NHT estimator. 

Development of suitable (IPPS, NHT) strategies raises 
theoretically challenging problems, including the evaluation 
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of exact joint inclusion probabilities, m,, or accurate 
approximations to 7,, requiring only the individual 7; s, 
that are needed in getting unbiased or nearly unbiased 
variance estimator. My own 1961 Ph.D. thesis at Iowa State 
University addressed the latter problem. Several solutions, 
requiring sophisticated theoretical tools, have been 
published since then by talented mathematical statisticians. 
However, this theoretical work is often classified as “theory 
without application” because it is customary practice to treat 
the PSUs as if sampled with replacement since that leads to 
great simplification. The variance estimator is simply 
obtained from the estimated PSU totals and, in fact, this 
assumption is the basis for re-sampling methods (section 6). 
This variance estimator can lead to substantial over-esti- 
mation unless the overall PSU sampling fraction is small. 
The latter may be true in many large-scale surveys. In the 
following paragraphs, I will try to demonstrate that the 
theoretical work on (IPPS, NHT) strategies as well as some 
non-IPPS designs have wide practical applicability. 

First, I will focus on (IPPS, NHT) strategies. In Sweden 
and some other countries in Europe, stratified single-stage 
sampling is often used because of the availability of list 
frames and IPPS designs are attractive options, but sampling 
fractions are often large. For example, Rosén (1991) notes 
that Statistics Sweden’s Labour Force Barometer samples 
some 100 different populations using systematic PPS 
sampling and that the sampling rates can exceed 50%. Aires 
and Rosén (2005) studied Pareto PS sampling for Swedish 
surveys. This method has attractive properties, including 
fixed sample size, simple sample selection, good estimation 
precision and consistent variance estimation regardless of 
sampling rates. It also allows sample coordination through 
permanent random numbers (PRN) as in Poisson sampling, 
but the latter method leads to variable sample size. Because 
of these merits, Pareto mPS has been implemented in a 
number of Statistics Sweden surveys, notably in price index 
surveys. Ohlsson (1995) described PRN techniques that are 
commonly used in practice. 

The method of Rao-Sampford (see Brewer and Hanif 
1983, page 28) leads to exact IPPS designs and non- 
negative unbiased variance estimators for arbitrary fixed 
sample sizes. It has been implemented in the new version of 
SAS. Stehman and Overton (1994) note that variable proba- 
bility structure arises naturally in environmental surveys 
rather than being selected just for enhanced efficiency, and 
that the 71,s are only known for the units 7 in the sample 
s. By treating the sample design as randomized systematic 
PPS, Stehman and Overton obtained approximations to the 
Tl; S that depend only m,, ies, unlike the original ap- 
proximations of Hartley and Rao (1962) that require the 
sum of squares of all the m,;s in the population. In the 
Stehman and Overton applications, the sampling rates are 
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substantial enough to warrant the evaluation of the joint 
inclusion probabilities. 

I will now turn to non-IPPS designs using estimators 
different from the NHT estimator that ensure zero variance 
when y is exactly proportional to x. The random group 
method of Rao, Hartley and Cochran (1962) permits a 
simple non-negative variance estimator for any fixed sample 
size and yet compares favorably to (IPPS, NHT) strategies 
in terms of efficiency and is always more efficient than the 
PPS with replacement strategy. Schabenberger and Gregoire 
(1994) noted that (IPPS, NHT) strategies have not enjoyed 
much application in forestry because of difficulty in im- 
plementation and recommended the Rao-Hartley-Cochran 
strategy in view of its remarkable simplicity and good 
efficiency properties. It is interesting to note that this 
strategy has been used in the Canadian LFS on the basis of 
its suitability for switching to new size measures, using the 
Keyfitz method within each random group. On the other 
hand, (IPPS, NHT) strategies are not readily suitable for this 
purpose (Fellegi 1966). I understand that the Rao-Hartley- 
Cochran strategy is often used in audit sampling and other 
accounting applications. 

Murthy (1957) used a non-IPPS design based on drawing 
successive units with probabilities p;,p,/U-p;), Pp, / 
(1— p; — p; ) and so on, and the following estimator: 


Pees OEE (9) 
ies p(s) 


where p(sli) is the conditional probability of obtaining 
the sample s given that unit i was selected first. He also 
provided a non-negative variance estimator requiring the 
conditional probabilities, p(sli, 7), of obtaining s given 
i and j are selected in the first two draws. This method did 
not receive practical attention for several years due to 
computational complexity, but more recently it has been 
applied in unexpected areas, including oil discovery 
(Andreatta and Kaufmann 1986) and sequential sampling 
including inverse sampling and some adaptive sampling 
schemes (Salehi and Seber 1997). It may be noted that 
adaptive sampling has received a lot of attention in recent 
years because of its potential as an efficient sampling 
method for estimating totals or means of rare populations 
(Thompson and Seber 1996). In the oil discovery appli- 
cation, the successive sampling scheme is a characterization 
of discovery and the order in which fields are discovered is 
governed by sampling proportional to field size and without 
replacement, following the industry folklore “on the 
average, the big fields are found first’. Here p; = y;/Y 
and the total oil reserve Y is assumed to be known from 
geological considerations. In this application, geologists are 
interested in the size distribution of all fields in the basin and 
when a basin is partially explored the sample is composed 
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of magnitudes y, of discovered deposits. The size distri- 
bution function F(a) can be estimated by using Murthy’s 
estimator (9) with y, replaced by the indicator variable 
I(y; <a). The computation of p(sli) and p(s), how- 
ever, is formidable even for moderate sample sizes. To over- 
come this computational difficulty, Andreatta and Kaufman 
(1986) used integral representations of these quantities to 
develop asymptotic expansions of Murthy’s estimator, the 
first few terms of which are easily computable. Similarly, 
they obtain computable approximations to Murthy’s vari- 
ance estimator. Note that the NHT estimator of F'(a) is not 
feasible here because the inclusion probabilities are func- 
tions of all the y — values in the population. 

The above discussion is intended to demonstrate that a 
particular theory can have applications in diverse practical 
areas even if it is not needed in a particular situation, such as 
large-scale surveys with negligible first stage sampling frac- 
tions. Also it shows that unequal probability sampling de- 
signs play a vital role in survey sampling, despite Sarndal’s 
(1996) contention that simpler designs, such as stratified 
SRS and stratified Bernoulli sampling, together with GREG 
estimators should replace strategies based on unequal proba- 
bility sampling without replacement. 


6. Analysis of Survey Data 
and Resampling Methods 


Standard methods of data analysis are generally based on 
the assumption of simple random sampling, although some 
software packages do take account of survey weights and 
provide correct point estimates. However, application of 
standard methods to survey data, ignoring the design effect 
due to clustering and unequal probabilities of selection, can 
lead to erroneous inferences even for large samples. In 
particular, standard errors of parameter estimates and 
associated confidence intervals can be seriously under- 
stated, type I error rates of tests of hypotheses can be much 
bigger than the nominal levels, and standard model 
diagnostics, such as residual analysis to detect model 
deviations, are also affected. Kish and Frankel (1974) and 
others drew attention to some of those problems and empha- 
sized the need for new methods that take proper account of 
the complexity of data derived from large-scale surveys. 
Fuller (1975) developed asymptotically valid methods for 
linear regression analysis, based on Taylor linearization 
variance estimators. Rapid progress has been made over the 
past 20 years or so in developing suitable methods. Re- 
sampling methods play a vital role in developing methods 
that take account of survey design in the analysis of data. 
All one needs is a data file containing the observed data, the 
final survey weights and the corresponding final weights for 
each pseudo-replicate generated by the re-sampling method. 
Software packages that take account of survey weights in 
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the point estimation of parameters of interest can then be 
used to calculate the correct estimators and standard errors, 
as demonstrated below. As a result, re-sampling methods of 
inference have attracted the attention of users as they can 
perform the analyses themselves very easily using standard 
software packages. However, releasing public-use data files 
with replicate weights can lead to confidentiality issues, 
such as the identification of clusters from replicate weights. 
In fact, at present a challenge to theory is to develop suitable 
methods that can preserve confidentiality of the data. Lu, 
Brick and Sitter (2004) proposed grouping strata and then 
forming pseudo-replicates using the combined strata for 
variance estimation, thus limiting the risk of cluster identifi- 
cation from the resulting public-use data file. Grouping 
strata and/or PSUs within strata simplifies variance esti- 
mation by reducing the number of pseudo-replicates used in 
variance estimation compared to the commonly used delete- 
cluster jackknife discussed below. A method of inverse 
sampling to undo the complex survey data structure and yet 
provide protection against revealing cluster labels (Hinkins, 
Oh and Scheuren 1997; Rao, Scott and Benhin 2003) 
appears promising, but much work on inverse sampling 
methods remains to be done before it becomes attractive to 
the user. 

Rao and Scott (1981, 1984) made a systematic study of 
the impact of survey design effect on standard chi-squared 
and likelihood ratio tests associated with a multi-way table 
of estimated counts or proportions. They showed that the 
test statistic is asymptotically distributed as a weighted sum 
of independent y; variables, where the weights are the 
eigenvalues of a “generalized design effects” matrix. This 
general result shows that the survey design can have a 
substantial impact on the type I error rate. Rao and Scott 
proposed simple first-order corrections to the standard chi- 
squared statistics that can be computed from published 
tables that include estimates of design effects for cell esti- 
mates and their marginal totals, thus facilitating secondary 
analyses from published tables. They also derived second 
order corrections that are more accurate, but require the 
knowledge of a full estimated covariance matrix of the cell 
estimates, as in the case of familiar Wald tests. However, 
Wald tests can become highly unstable as the number of 
cells in a mult-way table increases and the number of 
sample clusters decreases, leading to unacceptably high type 
I error rates compared to the nominal levels, unlike the Rao- 
Scott second order corrections (Thomas and Rao 1987). The 
first and second order corrections are now known as 
Rao-Scott corrections and are given as default options in the 
new version of SAS. Roberts, Rao and Kumar (1987) 
developed Rao-Scott type corrections to tests for logistic 
regression analysis of estimated cell proportions associated 
with a binary response variable. They applied the methods 
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to a two-way table of employment rates from the Canadian 
LFS 1977 obtained by cross-classifying age and education 
groups. Bellhouse and Rao (2002) extended the work of 
Roberts etal. to the analysis of domain means using 
generalized linear models. They applied the methods to 
domain means from a Fiji Fertility Survey cross-classified 
by education and years since the woman’s first marriage, 
where a domain mean is the mean number of children ever 
born for women of Indian race belonging to the domain. 
Re-sampling methods in the context of large-scale sur- 
veys using stratified multi-stage designs have been studied 
extensively. For inference purposes, the sample PSUs are 
treated as if drawn with replacement within strata. This 
leads to over-estimation of variances but it is small if the 
overall PSU sampling fraction is negligible. Let 6 be the 
survey-weighted estimator of a “census” parameter of inter- 
est computed from the final weights w,, and let the corre- 
sponding weights for each pseudo-replicate r generated by 
the re-sampling method be denoted by w‘”’. The estimator 


based on the pseudo-replicate weights w‘’? is denoted as 


6“ for each r=l,..., R. Then a re-sampling variance 
estimator of 8 is of the form 
lad R Nn Aw zw nw , 
v(0)= °c, (8°? —8)(0'” — 8) (10) 
r=1 


for specified coefficients c, in (10) determined by the re- 
sampling method. 

Commonly used re-sampling methods include (a) delete- 
cluster (delete-PSU) jackknife, (b) balanced repeated rep- 
lication (BRR) particularly for n, =2 PSUs in each stra- 
tum / and (c) the Rao and Wu (1988) bootstrap. Jackknife 
pseudo-replicates are obtained by deleting each sample 
cluster r= (hj) in turn, leading to jackknife design weights 
d‘") taking the value 0 if the sample unit i is in the 
deleted cluster, n,d;/(n, —1) if i is not in the deleted 
cluster but in the same stratum, and unchanged if 7 is ina 
different stratum. The jackknife design weights are then 
adjusted for unit non-response and _post-stratification, 
leading to the final jackknife weights w‘’’. The jackknife 
variance estimator is given by (10) with c, =(n, —1)/n, 
for r=(hj). The delete-cluster jackknife method has two 
possible disadvantages: (1) When the total number of sam- 
pled PSUs, n=Xin,, is very large, R is also very large 
because R=n. (2) It is not known if the delete-jackknife 
variance estimator is design-consistent in the case of non- 
smooth estimators 6, for example the survey-weighted 
estimator of the median. For simple random sampling, the 
jackknife is known to be inconsistent for the median or 
other quantiles. It would be theoretically challenging and 
practically relevant to find conditions for the consistency of 
the delete-cluster jackknife variance estimator of a non- 
smooth estimator 6. 
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BRR can handle non-smooth 6, but it is readily appli- 
cable only for the important special case of n, =2 PSUs 
per stratum. A minimal set of balanced half-samples can be 
constructed from an RXR Hadamard matrix by selecting 
H columns, excluding the column of +1’s, where H +1< 
R<H+4 (McCarthy 1969). The BRR design weights 
d‘") equal 2d, or 0 according as whether or not i is in 
the half-sample. A modified BRR, due to Bob Fay, uses all 
the sampled units in each replicate unlike the BRR by de- 
fining the replicate design weights as d‘"? (€) = (1+ €)d, 
or (1—e€)d, according as whether or not 7 is in the half- 
sample, where 0<€<1; a good choice of € is 1/2. The 
modified BRR weights are then adjusted for non-response 
and post-stratification to get the final weights w‘”’ (e) and 
the estimator 6°"(¢). The modified BRR variance 
estimator is given by (10) divided by €* and Q\” replaced 
by 6°”) (e); see Rao and Shao (1999). The modified BRR 
is particularly useful under independent re-imputation for 
missing item responses in each replicate because it can use 
the donors in the full sample to impute unlike the BRR that 
uses the donors only in the half-sample. 

The Rao-Wu bootstrap is valid for arbitrary n, (22) 
unlike the BRR, and it can also handle non-smooth 6. Each 
bootstrap replicate is constructed by drawing a simple 
random sample of PSUs of size n, —1 from the n, sample 
clusters, independently across the strata. The bootstrap 
design weights d‘” are given by [n, ((n, —1)]m\!’d, if 
i is in stratum h and replicate r, where m;’’ is the number 
of times sampled PSU (hi) is selected, ¥;m\!’ =n, —1. 
The weights d‘”) are then adjusted for unit non-response 
and post-stratification to get the final bootstrap weights and 
the estimator 6”). Typically, R=500 bootstrap replicates 
are used in the bootstrap variance estimator (10). Several 
recent surveys at Statistics Canada have adopted the boot- 
strap method for variance estimation because of the flex- 
ibility in the choice of R and wider applicability. Users of 
Statistics Canada survey micro data files seem to be very 
happy with the bootstrap method for analysis of data. 

Early work on the jackknife and the BRR was largely 
empirical (e.g., Kish and Frankel 1974). Krewski and Rao 
(1981) formulated a formal asymptotic framework appropri- 
ate for stratified multi-stage sampling and established design 
consistency of the jackknife and BRR variance estimators 
when 6 can be expressed as a smooth function of estimated 
means. Several extensions of this basic work have been 
reported in the recent literature; see the book by Shao and 
Tu (1995, Chapter 6). Theoretical support for re-sampling 
methods is essential for their use in practice. 

In the above discussion, I let 6 denote the estimator of a 
“census” parameter. The census parameter 9. is often 
motivated by an underlying super-population model and the 
census is regarded as a sample generated by the model, 
leading to census estimating equations whose solution is 
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0... The census estimating functions U..(@) are simply 
population totals of functions u, (6) with zero expectation 
under the assumed model, and the census estimating equa- 
tions are given by U..(@)=0 (Godambe and Thompson 
1986). Kish and Frankel (1974) argued that the census 
parameter makes sense even if the model is not correctly 
specified. For example, in the case of linear regression, the 
census regression coefficient could explain how much of the 
relationship between the response variable and the indepen- 
dent variables is accounted by a linear regression model. 
Noting that the census estimating functions are simply pop- 
ulation totals, survey weighted estimators U(@) from the 
full sample and U‘"?(@) from each pseudo-replicate are 
obtained. The solutions of corresponding estimating equa- 
tions U (6) =0 and U‘") (8) =0 give 6 and 6”? respec- 
tively. Note that the re-sampling variance estimators are 
designed to estimate the variance of 6 as an estimator of the 
census parameters but not the model parameters. Under 
certain conditions, the difference can be ignored but in 
general we have a two-phase sampling situation, where the 
census is the first phase sample from the super-population 
and the sample is a probability sample from the census 
population. Recently, some useful work has been done on 
two-phase variance estimation when the model parameters 
are the target parameters (Graubard and Korn 2002; Rubin- 
Bleuer and Schiopu-Kratina 2005), but more work is needed 
to address the difficulty in specifying the covariance 
structure of the model errors. 

A difficulty with the bootstrap is that the solution 6°”) 
may not exist for some bootstrap replicates r (Binder, 
Kovacevic and Roberts 2004). Rao and Tausi (2004) used 
an estimating function (EF) bootstrap method that avoids 
the difficulty. In this method, we solve U(0)=U"” (6) 
for 8 using only one step of the Newton-Raphson iteration 
with 6 as the starting value. The resulting estimator 6°” is 
then used in (10) to get the EF bootstrap variance estimator 
of 6 which can be readily implemented from the data file 
providing replicate weights, using slight modifications of 
any software package that accounts for survey weights. It is 
interesting to note that the EF bootstrap variance estimator is 
equivalent to a Taylor linearization sandwich variance esti- 
mator that uses the bootstrap variance estimator of U (0) 
and the inverse of the observed information matrix (deriv- 
ative of —U (8)), both evaluated at 9=6 (Binder et al. 
2004). 

Taylor linearization methods provide asymptotically val- 
id variance estimators for general sampling designs, unlike 
re-sampling methods, but they require a separate formula for 
each estimator 6. Binder (1983), Rao, Yung and Hidiroglou 
(2002) and Demnati and Rao (2004) have provided unified 
linearization variance formulae for estimators defined as 
solutions to estimating equations. 
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Pfeffermann (1993) discussed the role of design weights 
in the analysis of survey data. If the population model holds 
for the sample (i.e., if there is no sample selection bias), then 
model-based unweighted estimators will be more efficient 
than the weighted estimators and lead to valid inferences, 
especially for data with smaller sample sizes and larger 
variation in the weights. However, for typical data from 
large-scale surveys, the survey design is informative and the 
population mode! may not hold for the sample. As a result, 
the model-based estimators can be seriously biased and 
inferences can be erroneous. Pfeffermann and his colleagues 
initiated a new approach to inference under informative 
sampling; see Pfeffermann and Sverchkov (2003) for recent 
developments. This approach seems to provide more effi- 
cient inferences compared to the survey weighted approach, 
and it certainly deserves the attention of users of survey 
data. However, much work remains to be done, especially in 
handling data based on multi-stage sampling. 

Excellent accounts of methods for analysis of complex 
survey data are given in Skinner, Holt and Smith (1989), 
Chambers and Skinner (2003) and Lehtonen and Pahkinen 
(2004). 


7. Small Area Estimaton 


Previous sections of this paper have focussed on tradi- 
tional methods that use direct domain estimators based on 
domain-specific sample observations along with auxiliary 
population information. Such methods, however, may not 
provide reliable inferences when the domain sample sizes 
are very small or even zero for some domains. Domains or 
sub-populations with small or zero sample sizes are called 
small areas in the literature. Demand for reliable small area 
statistics has greatly increased in recent years because of the 
growing use of small area statistics in formulating policies 
and programs, allocation of funds and regional planning. 
Clearly, it is seldom possible to have a large enough overall 
sample size to support reliable direct estimates for all 
domains of interest. Also, in practice, it is not possible to 
anticipate all uses of survey data and “‘the client will always 
require more than is specified at the design stage” (Fuller 
1999, page 344). In making estimates for small areas with 
adequate level of precision, it is often necessary to use 
“indirect” estimators that borrow information from related 
domains through auxiliary information, such as census and 
current administrative data, to increase the “effective” 
sample size within the small areas. 

It is now generally recognized that explicit models 
linking the small areas through auxiliary information and 
accounting for residual between - area variation through 
random small area effects are needed in developing indirect 
estimators. Success of such model-based methods heavily 
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depends on the availability of good auxiliary information 
and thorough validation of models through internal and 
external evaluations. Many of the random effects methods 
used in mainstream statistical theory are relevant to small 
area estimation, including empirical best (or Bayes), empir- 
ical best linear unbiased prediction and hierarchical Bayes 
based on prior distributions on the model parameters. A 
comprehensive account of such methods is given in Rao 
(2003). Practical relevance and theoretical interest of small 
area estimation have attracted the attention of many re- 
searchers, leading to important advances in point and mean 
squared error estimation. The “new” methods have been 
applied successfully worldwide to a variety of small area 
problems. Model-based methods have been recently used to 
produce county and school district estimates of poor school- 
age children in the U.S.A. The U.S. Department of Edu- 
cation allocates annually over $7 billion of funds to counties 
on the basis of model-based county estimates. The allocated 
funds support compensatory education programs to meet the 
needs of educationally disadvantaged children. We refer to 
Rao (2003, example 7.1.2) for details of this application. In 
the United Kingdom, the Office for National Statistics 
established a Small Area Estimation Project to develop 
model-based estimates at the level of political wards 
(roughly 2,000 households). The practice and estimation 
methods of U.S. federal statistical programs that use indirect 
estimators to produce published estimates are documented 
in Schaible (1996). Singh, Gambino and Mantel (1994) and 
Brackstone (2002) discuss some practical issues and strat- 
egies for small area statistics. 

Small area estimation is a striking example of the inter- 
play between theory and practice. The theoretical advances 
are impressive, but many practical issues need further 
attention of theory. Such issues include: (a) Benchmarking 
model-based estimators to agree with reliable direct esti- 
mators at large area levels. (b) Developing and validating 
suitable linking models and addressing issues such as errors 
in variables, incorrect specification of the linking model and 
omitted variables. (c) Development of methods that satisfy 
multiple goals: good area-specific estimates, good rank 
properties and good histogram for small areas. 


8. Some Theory Deserving Attention of 
Practice and Vice Versa 


In this section, I will briefly mention some examples of 
important theory that exists but not widely used in practice. 


8.1 Empirical Likelihood Inference 


Traditional sampling theory largely focused on point 
estimation and associated standard errors, appealing to nor- 
mal approximations for confidence intervals on parameters 


Survey Methodology, December 2005 


of interest. In mainstream statistics, the empirical likelihood 
(EL) approach (Owen 1988) has attracted a lot of attention 
due to several desirable properties. It provides a non- 
parametric likelihood, leading to EL ratio confidence inter- 
vals similar to the parametric likelihood ratio intervals. The 
shape and orientation of EL intervals are determined en- 
tirely by the data, and the intervals are range preserving and 
transformation respecting, and are particularly useful in 
providing balanced tail error rates, unlike the symmetric 
normal theory intervals. As noted in section 3.1, the EL 
approach was in fact first introduced in the sample survey 
context by Hartley and Rao (1968), but their focus was on 
inferential issues related to point estimation. Chen, Chen 
and Rao (2003) obtained EL intervals on the population 
mean under simple random and stratified random sampling 
for populations containing many zeros. Such populations are 
encountered in audit sampling, where y denotes the 
amount of money owed to the government and the mean Y 
is the average amount of excessive claims. Previous work 
on audit sampling used parametric likelihood ratio intervals 
based on parametric mixture distributions for the variable 
y. Such intervals perform better than the standard normal 
theory intervals, but EL intervals perform better under 
deviations from the assumed mixture model, by providing 
non-coverage rate below the lower bound closer to the 
nominal error rate and also larger lower bound. For general 
designs, Wu and Rao (2004) used a pseudo-empirical 
likelihood (Chen and Sitter 1999) to obtain adjusted pseudo- 
EL intervals on the mean and the distribution function that 
account for the design features, and showed that the 
intervals provide more balanced tail error rates than the 
normal theory intervals. The EL method also provides a 
systematic approach to calibration estimation and integra- 
tion of surveys. We refer the reader to the review papers by 
Rao (2004) and Wu and Rao (2005). 

Further refinements and extensions remain to be done, 
particularly on the pseudo-empirical likelihood, but the EL 
theory in the survey context deserves the attention of 
practice. 


8.2 Exploratory Analyses of Survey Data 


In section 6 we discussed methods for confirmatory 
analysis of survey data taking the design into account, such 
as point estimation of model (or census) parameters and 
associated standard errors and formal tests of hypotheses. 
Graphical displays and exploratory data analyses of survey 
data are also very useful. Such methods have been exten- 
sively developed in the mainstream literature. Only recently, 
some extensions of these modern methods are reported in 
the survey literature and deserve the attention of practice. I 
will briefly mention some of those developments. First, non- 
parametric kernel density estimates are commonly used to 
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display the shape of a data set without relying on parametric 
models. They can also be used to compare different sub- 
populations. 

Bellhouse and Stafford (1999) provided kernel density 
estimators that take account of the survey design and studied 
their properties and applied the methods to data from the 
Ontario Health Survey. Buskirk and Lohr (2005) studied 
asymptotic and finite sample properties of kernel density 
estimators and obtained confidence bands. They applied the 
methods to data from the US National Crime Victimization 
Survey and the US National Health and Nutrition Exam- 
imation Survey. 

Secondly, Bellhouse and Stafford (2001) developed local 
polynomial regression methods, taking design into account, 
that can be used to study the relationship between a re- 
sponse variable and predictor variables, without making 
strong parametric model assumptions. The resulting graph- 
ical displays are useful in understanding the relationships 
and also for comparing different sub-populations. Bellhouse 
and Stafford (2001) illustrated local polynomial regression 
on the Ontario Health Survey data; for example, the 
relationship between body mass index of females and age. 
Bellhouse, Chipman and Stafford (2004) studied additive 
models for survey data via penalized least squares method to 
handle more than one predictor variable, and illustrated the 
methods on the Ontario Health Survey data. This approach 
has many advantages in terms of graphical display, 
estimation, testing and selection of “smoothing” parameters 
for fitting the models. 


8.3. Measurement Errors 


Typically, measurement errors are assumed to be addi- 
tive with zero means. As a result, usual estimators of totals 
and means remain unbiased or consistent. However, this 
nice feature may not hold for more complex parameters 
such as distribution functions, quantiles and regression 
coefficients. In the latter case, the usual estimators will be 
biased, even for large samples, and hence can lead to 
erroneous inferences (Fuller 1995). It is possible to obtain 
bias-adjusted estimators if estimates of measurement error 
variances are available. The latter may be obtained by 
allocating resources at the design stage to make repeated 
observations on a sub-sample. Fuller (1975, 1995) has been 
a champion of proper methods in the presence of 
measurement errors and the bias-adjusted methods deserve 
the attention of practice. 

Hartley and Rao (1978) and Hartley and Biemer (1978) 
provided interviewer and coder assignment conditions that 
permit the estimation of sampling and response variances 
for the mean or total from current surveys. Unfortunately, 
current surveys are often not designed to satisfy those 
conditions and even if they do the required information on 
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interviewer and coder assignments is seldom available at the 
estimation stage. 

Linear components of variance models are often used to 
estimate interviewer variability. Such models are appropri- 
ate for continuous responses, but not for binary responses. 
The linear model approach for binary responses can result in 
underestimating the intra-interviewer correlations. Scott and 
Davis (2001) proposed multi-level models for binary re- 
sponses to estimate interviewer variability. Given that re- 
sponses are often binary in many surveys, practice should 
pay attention to such models for proper analyses of survey 
data with binary responses. 


8.4 Imputation for Missing Survey Data 


Imputation is commonly used in practice to fill in 
missing item values. It ensures that the results obtained from 
different analyses of the completed data set are consistent 
with one another by using the same survey weight for all 
items. Marginal imputation methods, such as ratio, nearest 
neighbor and random donor within imputation classes are 
used by many statistical agencies. Unfortunately, the im- 
puted values are often treated as if they were true values and 
then used to compute estimates and variance estimates. The 
imputed point estimates of marginal parameters are gen- 
erally valid under an assumed response mechanism or impu- 
tation model. But the “naive” variance estimators can lead 
to erroneous inferences even for large samples; in particular, 
serious underestimation of the variance of the imputed esti- 
mator because the additional variability due to estimating 
the missing values is not taken into account. Advocates of 
Rubin’s (1987) multiple imputation claim that the multiple 
imputation variance estimator can fix this problem because 
a between imputed estimators sum of squares is added to the 
average of naive variance estimators resulting from the 
multiple imputations. Unfortunately, there are some diffi- 
culties associated with multiple imputation variance esti- 
mators, as discussed by Kott (1995), Fay (1996), Binder and 
Sun (1996), Wang and Robins (1998), Kim, Brick, Fuller 
and Kalton (2004) and others. Moreover, single imputation 
is often preferred due to operational and cost considerations. 
Some impressive advances have been made in recent years 
on making efficient and asymptotically valid inferences 
from singly imputed data sets. We refer the reader to review 
papers by Shao (2002) and Rao (2000, 2005) for methods of 
variance estimation under single imputation. Kim and Fuller 
(2004) studied fractional imputation using more than one 
randomly imputed value and showed that it also leads to 
asymptotically valid inferences; see also Kalton and Kish 
(1984) and Fay (1996). An advantage of fractional impu- 
tation is that it reduces the imputation variance relative to 
single imputation using one randomly imputed value. The 
above methods of variance estimation deserve the attention 
of practice. 
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8.5 Multiple Frame Surveys 


Multiple frame surveys employ two or more overlapping 
frames that can cover the target poulation. Hartley (1962) 
studied the special case of a complete frame B and an 
incomplete frame A and simple random sampling indepen- 
dently from both frames. He showed that an “optimal” dual 
frame estimator can lead to large gains in efficiency for the 
same cost over the single complete frame estimator, pro- 
vided the cost per unit for frame A is significantly smaller 
than the cost per unit for frame B. Multiple frame surveys 
are particularly suited for sampling rare or hard-to-reach 
populations, such as homeless populations and persons with 
AIDS, when incomplete list frames contain high proportions 
of individuals from the target population. Hartley’s (1974) 
landmark paper derived “optimal” dual frame estimators for 
general sampling designs and possibly different obser- 
vational units in the two frames. Fuller and Burmeister 
(1972) proposed improved “optimal” estimators. However, 
the optimal estimators use different sets of weights for each 
item y, which is not desirable in practice. Skinner and Rao 
(1996) derived pseudo-ML (PML) estimators for dual frame 
surveys that use the same set of weights for all items y, 
similar to “single frame” estimators (Kalton and Anderson 
1986), and maintain efficiency. Lohr and Rao (2005) 
developed a unified theory for the multiple frames setting 
with two or more frames, by extending the optimal, pseudo- 
ML and single frame estimators. Lohr and Rao (2000, 2005) 
obtained asymptotically valid jackknife variance estimators. 
Those general results deserve the attention of practice when 
dealing with two or more frames. Dual frame telephone 
surveys based on cell phone and landline phone frames need 
the attention of theory because it is unclear how to weight in 
the cell phone survey: some families share a cell phone and 
others have a cell phone for each person. 


8.6 Indirect Sampling 


The method of indirect sampling can be used when the 
frame for a target population U” is not available but the 
frame for another population U*, linked to U*, is 
employed to draw a probability sample. The links between 
the two populations are used to develop suitable weights 
that can provide unbiased estimators and variance esti- 
mators. Lavallée (2002) developed a unified method, called 
Generalized Weight Sharing, (GWS), that covers several 
known methods: the weight sharing method of Ernst (1989) 
for cross sectional estimation from longitudinal household 
surveys, network sampling and multiplicity estimation 
(Sirken 1970) and adaptive cluster sampling (Thompson 
and Seber 1996). Rao’s (1968) theory for sampling from a 
frame containing an unknown amount of duplication may be 
regarded as a special case of GWS. Multiple frames can also 
be handled by GWS and the resulting estimators are simple 
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but not necessarily efficient compared to the optimal esti- 
mators of Hartley (1974) or the PML estimators. The GWS 
method has wide applicability and deserves the attention of 
practice. 


9. Concluding Remarks 


Joe Waksberg’s contributions to sample survey theory 
and methods truly reflect the interplay between theory and 
practice. Working at the US Census Bureau and later at 
Westat, he faced real practical problems and produced 
sound theoretical solutions. For example, his landmark pa- 
per (Waksberg 1978) studied an ingenious method (pro- 
posed by Warren Mitofsky) for random digit dialing (RDD) 
that significantly reduces the survey costs compared to 
dialing numbers completely at random. He presented sound 
theory to demonstrate its efficiency. The widespread use of 
RDD surveys is largely due to the theoretical development 
in Waksberg (1978) and subsequent refinements. Joe 
Waksberg is one of my heroes in survey sampling and I feel 
greatly honored to have received the 2005 Waksberg award 
for survey methodology. 
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Hot Deck Imputation for the Response Model 


Wayne A. Fuller and Jae Kwang Kim ' 


Abstract 


Hot deck imputation is a procedure in which missing items are replaced with values from respondents. A model supporting 
such procedures is the model in which response probabilities are assumed equal within imputation cells. An efficient version 
of hot deck imputation is described for the cell response model and a computationally efficient variance estimator is given. 
An approximation to the fully efficient procedure in which a small number of values are imputed for each nonrespondent is 
described. Variance estimation procedures are illustrated in a Monte Carlo study. 


Key Words: Nonresponse; Fractional imputation; Response probability; Replication variance estimation. 


1. Introduction 


Imputation is used in sample surveys as a method of 
handling item nonresponse. In hot deck imputation, the 
imputed values are functions of the respondents in the 
current sample. Sande (1983) and Ford (1983) contain 
descriptions of hot deck imputation. Kalton and Kasprzyk 
(1986) and Little and Rubin (2002) review various impu- 
tation procedures. 

In one version of hot deck imputation, the imputed value 
is the value of a respondent in the same imputation cell, 
where the imputation cells form an exhaustive and mutually 
exclusive subdivision of the population. In random hot deck 
imputation, respondents are assigned values at random from 
respondents in the same imputation cell. The record 
providing the value is called the donor and the record with 
the missing value is called the recipient. 

The variance of the imputed estimator is generally larger 
than the complete sample variance because nonresponse 
reduces sample size and because the imputed estimator may 
contain a component due to random imputation. Rao and 
Shao (1992) proposed an adjusted jackknife method for hot- 
deck imputation where the first phase units are selected 
with-replacement. Rao and Sitter (1995) discussed the 
adjusted jackknife variance estimation method for ratio 
imputation. Rao (1996) and Sitter (1997) applied the 
adjusted jackknife method to regression imputation. Shao, 
Chen and Chen (1998) apply the idea of Rao and Shao 
(1992) to the balanced repeated replication method. Shao 
and Steel (1999) propose variance estimation for survey 
data with composite imputation, where more than one 
imputation method is used, and the sampling fractions are 
included in the variance expressions. Yung and Rao (2000) 
applied the adjusted jackknife method to imputed estimators 
constructed with a poststratified sample. Rubin (1987) and 


Rubin and Schenker (1986) suggested multiple imputation 
procedures. Tollefson and Fuller (1992), and Sarndal (1992) 
proposed imputation methods and corresponding variance 
estimators. Kim and Fuller (2004) studied the use of 
fractional imputation for the model in which observations in 
an imputation cell are independently and _ identically 
distributed. 

In this paper, we consider hot deck imputation for a 
population divided into imputation cells. The response 
model is described in section 2. In section 3, we introduce 
fully efficient fractional imputation and present a variance 
estimation method for the imputation estimator, under the 
assumptions that the probability of nonresponse is constant 
within a cell. In section 4 we suggest a modification of the 
fully efficient method that uses a smaller number of donors. 
In section 5, an example is introduced to illustrate the actual 
implementation of the proposed method. In section 6, results 
of a simulation study are reported. Summary is presented in 
the last section. 


2. Basic Setup 


Consider a population of N elements identified by a set of 
indices U = {1, 2,..., N}. Associated with each unit 7 in the 
population there is a study variable y, and a vector x, of 
auxiliary information. The set of vectors, (y,,x;), 
p=1,2 Ns is ocworcd DY 1" 

Let A denote the indices of the elements in a sample 
selected by a set of probability rules called the sampling 
mechanism. Let the population quantity of interest be 0,,, 
let 6 be a full sample, linear-in-y, estimator of 6,, and 
write 


0=)° w, y;. (1) 
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If w, is the inverse of the selection probability, then 6 is 
unbiased for the population total. 

Let A, and A,, denote the set of indices of the sample 
respondents and sample nonrespondents, respectively. 
Define the response indicator function 


, if ie A, 
= Fon (2) 
O ite A, 

and let R = {(i, R,);ie A}. The distribution of R is called 
the response mechanism. 

Assume that the finite population U is made up of G 
imputation cells, where the set of elements in cell g is U,. 
Let n, be the number of sample elements in imputation cell 
g and let r,,7, >0, be the number of respondents in impu- 
tation cell g. Assume the within-cell uniform response 
model in which the r, responses in a cell are equivalent to a 
Poisson sample selected with equal probabilities from the 
n, elements. 

Fractional imputation is a procedure in which more than 
one donor is used per recipient. Kalton and Kish (1984) 
suggested fractional imputation as an efficient imputation 
procedure. The method was discussed by Fay (1996). Let 
d,, be the number of times that y, is used as donor for the 
missing y, and define d={d,;ie Ap, j¢ Ay}. The 
distribution of d is called the imputation mechanism. Let 
W; be the factor applied to the original weight for element j 
when y,; is used as a donor for element j. For element 
J ’ J € Ay ? 

Hifes di Wi Y; (3) 


1€ Ap 


is the weighted mean of the respondent values. The factor 
w;, is called the imputation fraction. It is the fraction that 
donor i donates for the missing item y,. Note that w;; =1 
for ic A, and w, =O for i# j,i, j€ Ay. The sum of the 
imputation fractions for a missing item is restricted to equal 
one, 


oe aL 


1€ Ap 


Vie A. (4) 


An estimator with the imputed values defined in (3) and 
some Wi <1 is called a fractionally imputed estimator. 

A linear-in-y imputation estimator can be written in the 
form 


6, SF mip “mh Jj (5) 


i€Ap\ jEA 


= nan (6) 


ic Ap 
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where the notation A =: B means that B is defined to be 
equal to A. The sum of Wi; w, over all recipients for which i 
is a donor (including acting as a donor for itself), denoted by 
G,, is the total weight of donor 7. If a responding unit 7 is 
not used as a donor, except for itself, then ; = w,. 


3. Fully Efficient Fractional Imputation 


Assume all elements in an imputation cell have the same 
probability of responding and assume the responses are 
independent. Then the overall distribution of an imputed 
estimator under the response model can be obtained by 
using the probability structure of multiple phase sampling, 
where the response model is treated as the second phase 
sampling mechanism. 

If the response probabilities in a cell are uniform, then a 
reasonable estimator of the total is the weighted sum of ratio 
estimators 


; WY; 
: ae AgOu,, (7) 


rer WG 


In the context of two phase sampling, Kott and Stukel 
(1997) call the estimator (7) a reweighted expansion esti- 
mator. The estimator (7) is called fully efficient because it 
contains no variability due to random selection of donors. If 
the w, are the same for all elements in a cell, the ratio 


in 3 yw 


g=l\ icANU, 


ah “| pa Wi Yi (8) 


ic AROU, ie ApOU, 


is a simple mean and, hence, unbiased for the cell mean 
given that there is at least one respondent in the cell. If the 
w, in a cell are not equal, then (8) is subject to ratio bias. It 
is possible for the number of elements in a cell, n Ws. be 
positive and the number of respondents, tr >,f0 be zero. If 
this occurs in practice, cells will be collapsed. 

The large sample properties of the estimator can be 
obtained for a sequence of populations and samples. 
Assume the population is composed of G, mutually 
exclusive and exhausted cells, where v is the index of the 
sequence. Assume the variance of a full sample estimator of 
the mean is O(n,'), where n, is the size of the sample 
selected from the v” population. Assume responses are 
independent. Then, under regularity conditions, the proce- 
dures used by Kim, Navarro and Fuller (2005) in the proof 
of their Theorem 2.1 can be used to show that estimator (7) 
satisfies 


i . Gv 
Om =6,+ ers Ca R,, -Ne,, +0,(n," N.), (9) 


g,=1 ieA,, 
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where e,, = y;,—Y,,, A,, is the set of sample indices in the 
g" cell for the v sample, ae is the population mean of 
the y-variable in cell gv of population F,,7,, is the prob- 
ability that an element in cell gv responds, and F, denotes 


the v™ population. Also 


V(6-., 1F,) = V(6,1F,) 


Gy 
zp) T(l-1,,) >. wee, in (10) 
(sie 


ie Agyv 


where 
Oxi: = 8, if >» Di Wi, (T,, R,, + lap: 


The estimator (7) can be implemented by using fractional 
imputation in which every responding unit in an imputation 
cell is used as a donor for every nonrespondent in the cell. 
Then, the estimator (7) can be written as the fractionally 
imputed estimator 


G 


Oar =») De W Wis Yi» (11) 


g=l jeAnU, ieAROU, 


where w, Wi; is the weight of donor i for recipient /, Wi; is 
the imputation fraction of donor i for recipient j defined in 
(3), and 
=I Q 
was San w, | wR, if R; =0 
1 if-R; =1 and i= j. 


(12) 


The estimator (11) with W; of (12), algebraically 
equivalent to (7), is called the fully efficient fractionally 
imputed (FEFI) estimator. The fractionally imputed esti- 
mator has the advantage that functions of y such as the 
fraction less than a given number can be directly estimated 
from the fractionally imputed data set. 

To consider replication variance estimation, let a replica- 
tion variance estimator for the complete sample be 


L 
V@)= >) 6 — 8)’, (13) 
k=1 
where 6? is the k” estimate of @, based on the observa- 
tions included in the k™ replicate, L is the number of repli- 
cates, and c, is a factor associated with replicate k deter- 
mined by the replication method. For a discussion of 
replication for survey samples see Krewski and Rao (1981) 
and Rao, Wu and Yue (1992). When the original estimator 
6 is a linear estimator of the form (1), the k™ replicate 
estimate of 6 can be written 


ayaa) = yx Eee (14) 


ig A 
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where w\’ denotes the replicate weight for the i” unit of 
the k" replication. 

A proposed replicate for the estimator 0,,,,, 1s 
(k) 
G ” wry; 
Cir We (k) ieAROU, ! t 
A tick icAgOU, 
wa (k), #(k) 
sy DS wi Y, (15) 


g=l jeAOU, icAgrU, 


Using the replicates (15), the replicate variance estimator 
can be written as 


L 
Veer = Y Ch Cee a Oba): (16) 


on 
ll 
— 


The replicates in (15) can be computed in two steps. 
First, create the usual replicate by defining the weights w*“’ 
for every element. Second, for a nonrespondent, the repli- 
cate imputation fraction for donor i to recipient / is 


(k) 
x WwW: 
wr = : 


n> _ 
SEAROU , lige 


Note that the sum of the fractional replication weights of the 
donor records for each recipient is the same as the replica- 
tion weight for that unit in a complete sample. 

The suggested procedure is closely related to the Rao and 
Shao (1992) variance estimator. See also Yung and Rao 
(2000). However, the use of fractional imputation greatly 
simplifies variance estimation. In the creation of replicates, 
only the weights on the imputed values are changed. No 
recomputing of imputed values is required, and once 
computed, the replicate weights can be used for any smooth 
function of the vector y. Also, the fractional replicates make 
the estimator (16) appropriate for a vector of y—variables. 

Theorem 3.1 of Kim, Navarro and Fuller (2005) can be 
used to show that, given a consistent full sample replication 
procedure, 


Veer = V(Orp, | F,) 


G, 
-N,”> > 1,,0-7,,)e, +0,(n, ), (17) 
g,=1 iceU,, 
where @,,., is defined in (10), and the distribution is with 
respect to the sampling and response mechanisms. 
If the finite population correction can be ignored, the 
estimator (16) is consistent for V{6.. }. If the sample size is 


large relative to N, then an estimator of 


G, 
Laie) T,,l-TRae,, 


g,=1 teU,, 


should be added to (16). 
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The imputation and variance estimation procedure 
outlined for the response model also produces consistent 
estimators for the cell mean model. Under the cell mean 
model, the elements within a cell of the finite population are 
a realization of independently and identically distributed 
random variables. The imputation procedure based on the 
response model is not necessarily fully efficient for the 
population mean under the cell mean model, but it can be 
shown that the estimator of the mean and the estimator of 
the variance of the estimated mean are consistent. 


4. Approximations to the Fully 
Efficient Procedure 


In the previous sections, the estimator Dhee was 
constructed to produce zero imputation variance. The 
implementation of the fractional imputation procedure, as 
described in (11), could require the use of a large number of 
donors for each recipient. Therefore, we outline a procedure 
with a fixed number of donors per recipient that is fully 
efficient for the grand total, but not necessarily fully 
efficient for subpopulations. The procedure assigns donors 
to produce small between-recipient variance of imputed 
values and modifies the weights of donors to attain full 
efficiency for the total. 

Suppose that M donors are to be assigned to each 
recipient. We suggest donors be assigned to recipients to 
approximate the distribution of all respondents in the cell. 
One possible selection method is to select a stratified sample 
for each recipient. A second possibility is to use systematic 
sampling with probability proportional to the weights to 
select donors for each recipient. Initial fractions w,,. are 
assigned to the donated values. For systematic sampling 
with equal weights, the initial w;,) is M~’. 

After the donors are assigned, the initial fractions, Wiig 
are adjusted so that the sum of the weights gives the fully 
efficient estimator of the mean of y, and such that the 
estimated cumulative distribution function based on the 
weights approximates the fully efficient estimator of the 
cumulative distribution function. The modification of 
weights using regression has been suggested by Fuller 
(1984, 2003). Chen, Rao and Sitter (2000) discussed an 
efficient imputation method that changes the imputed values 
rather than the weights. Let z, ; =(Z, j1» Zp j2>- 2 De 
a vector defined by 


+2 &¢ jo 
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aj ed 
Z,2 =1 ify,sL 


= (0 otherwise 


N 
II 


te it Las yg ale 


QO otherwise, 


where L,, L,,..., L,, divide the range of observed y in cell 
g into @—1 sections. The number of sections that can be 
used depends on the numbers and type of observations in 
the cell, the number of recipients and the number of donors 
per recipient. If the number of donors per recipient is large, 
it is possible to adjust the set of weights for each recipient so 
that the sum of Wi, over i is one for every j and the sum of 
w;, y; Over i is the fully efficient estimator for every j. In 
most cases the weights will be adjusted so that the sum of 
the W; over i is one for every j and the cell means of the 
imputed values are equal to the fully efficient estimator. 

Let Z;,,, denote the fully efficient estimator for cell g. 
Using regression procedures, the modified Wiss modified to 


give the fully efficient cell mean of z, are 
W, = Wyo t (Zune —2_) Song iiso Zeke Zeny) 1 mule 


where 


* _— , = 
TS pz b; >, Win (Z gta j— Ze -j) (Zgiyg— Ze -j)4y > 


JEAz, i€Ap, 


= * 
Z. cae > Wijo Z stil d; > 


ic Ap, 


* * 
g ys b; >) Win Zot Fi» 


jeA,, iG Any 


6-[E»,) 


SEA), 


Z 


A,, 1s the set of indexes of recipients in cell g, Z,1;); =Z,; 
is the value imputed from donor i for recipient j, and Z,.; is 
the weighted mean of the imputed values for recipient j 
using the initial w;,,. 

To estimate the variance, replicates are created so that the 
weights on the donors reflect the effect of the deletion of an 
element on the fully efficient estimator. We use the words 
“deletion” and “delete” to identify the element chosen for 
principal weight modification for replication variance 
estimation. 

Let w\ be the weight assigned to element i for the k™ 
replicate for variance estimation of the full sample esti- 
mator. Then the replicate for the fully efficient mean of y for 
cell g is 
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=i 
atk) _ (k) (k) 
Zz, = » W; ay Zi. (19) 


ic Ap, ic Ap, 


Replicate fractions are assigned to donors in cell g so that 
the replicate estimate of the cell mean is oe Initial 
fractional weights w,(” are assigned where w; “() is small, 
but positive, if 7 is a aelefed unit for replicate fe The final 
fractional weights w,") are computed using the procedure 
of (18) with Zz replacing Zrp,, and w;,) replacing w;». 
The Py eecute simulates the ee of deleting a single 
element on the fully efficient estimator. 


5. An Artificial Example 


In this section, we present an example with artificial data 
to illustrate the implementation of the proposed method. 
Suppose that two study variables, x and y, are observed in a 
sample of size n = 10 obtained by simple random sampling. 
Variable x is a categorical variable with three categories, say 
1, 2, and 3, and variable y is a continuous variable. Both 
variables have item nonresponse and there is a set of 
imputation cells for each variable. Table 5.1 shows the 
sample observations, where nonresponse is denoted by M in 
the table. We use a weight of one to simplify the presen- 
tation. Divide by ten to obtain weights for the mean. 


Table 5.1 
An Illustrative Data Set 


Observation Weight Cellforx _Cell fory 
1 1 


IgaN CoG wh Zs Zak 


Xx 
1 
2 
3 
M 
1 
Zz 
3 
3 
2 
M 


Re NOR NOR NK NR 


So CAIDNAWNE 
ee 
NON NWN PD PR RR eR 


Because the x variable is a categorical variable with three 
categories, using three fractions for fractional imputation 
gives fully efficient estimators for the distribution of the 
x—variable. Thus the weights in Table 5.2 for the three 
imputed values of x for observation four are the fractions for 
the three categories in x—cell one. 

If a subset of donors is to be used for each recipient, a 
controlled method of selecting donors, such as systematic 
sampling, is suggested. In our simple illustration we could 
easily use fractional imputation with all four y responses in 
cell 1, but to illustrate the regression adjustment we use only 
three. See Table 5.2. 

Several approaches are possible for the situation in which 
two items are missing, including the definition of a third set 
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of imputation cells for such cases. Because of the small size 
of our illustration, we impute under the assumption that x 
and y are independent within cells. Thus we impute four 
values for observation ten. For each of the two possible 
values of x we impute two possible values for y. One of the 
pair of imputed y—values is chosen to be less than the mean 
of responses and one is chosen to be greater than the mean. 
See the imputed values for observation 10 in Table 5.2. 


Table 5.2 
Fractional Weights for Means 


Observation Weight Donor fory Cellforx Cellfory x  y 


1 1.0000 1 1 | aan 
2 0.2886 1 1 1 Fs 
2 0.3960 6 1 1 Oe 1S 
Z 0.3154 8 1 1 a) 
3 0.3333 5 1 io a ta 
B 0.3333 ‘: 1 2 a 8 
S) 0.3334 ) 1 z She: 
4 0.5000 1 1 1 14 
4 0.2500 1 1 2 14 
4 0.2500 1 1 3 «14 
5 1.0000 1 2 1 3 
6 1.0000 &, 1 Pa 
4; 1.0000 2 Z SG 
8 1.0000 Z 1 tet, 
9 1.0000 2 2 Cee 
10 0.2247 8 2 1 Os ol) 
10 0:2753 4 Z 1 2 14 
10 0.2095 1 Z 1 Sie ey 
10 0.2905 6 z i Et 


Initial fractions of one third are assigned to the three 
imputed values for observations three and four, and initial 
fractions of one fourth are assigned to the four imputed 
values for observation ten. The fractional weights are then 
adjusted using the regression method of equation (18) to 
give the FEFI mean of y as the estimator, where the fully 
efficient estimator for the mean of y is 


We restrict the weights for observation 10 so that the 
estimated fractions for the two categories of x are the cell 
fractions. Then, because the weighted mean for the categor- 
ical variable is controlled for each individual, the vector z 
contains only the y-variable. Table 5.2 gives the final 
fractional weights computed with the regression weighting. 

An analyst can use the data set of Table 5.2 and any full- 
sample computer program to compute estimates of 
functions of y and x, such as the mean of y for the x cate- 
gories. The fractional data set is fully efficient for any 
function of the x—variable and is also fully efficient for the 
mean of the y—variable. 

For jackknife variance estimation, we repeat the weight 
calculation for each replicate. The replicate estimates of the 
cell means of y are given in Table 5.3 and the replicate 
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estimates of the category fractions for x are given in Table 
5.4. The values in Table 5.3 and in Table 5.4 are used as the 
control totals Z,, in the regression weighting. We used 
w;) =3°' as the initial value of the replication fractions 
for observation two and Wipe = 4" for observation ten. 
Table 5.5 contains the jackknife weights for the 
fractionally imputed data set of Table 5.2. The replicate 
weights are used in the same way as replicates for a full 
sample. They are appropriate, with the caveats of the next 
section, for any statistic for which the full sample jackknife 
is appropriate. Thus the procedure is particularly appealing 
for a general purpose data set, because no additional 
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where rey is the within cell sample variance for cell g . If 
we use the replication weights in Table 5.5, the replication 
variance estimate for the mean of y is 


10 
Vix On) = by 0” ey 


k=1 


—~ Yq)’ = 3.078. 


The difference between the linearized variance estimator 
and the jackknife variance estimator is 


computations are required of the analyst. 

The fully efficient estimator of the mean of y is obtained 
by treating the respondents as the second phase of a two 
phase sample. A two-phase variance estimator is 


the true variance in this example. 


Thus, the jackknife variance estimator slightly overestimates 


Table 5.3 
Jackknife Replicates of Cell Mean of y—variable 
Cell Replicate 
1 p) 3 4 5 6 Ti 8 9 10 
1 12.67 eas 25 10.33 11.25 10.00 11.25 12.00 LDS LS 
2 4,33 4.33 4.33 4.33 5.00 4,33 2.50 4,33 5.50 4,33 
Table 5.4 
Jackknife Replicates of Cell Mean of the Dummy Variables of x—variable 
Cell Level of x Replicate 
1 » 3 4 5 6 i 8 9 10 
1 0.33 0.67 0.67 0.50 0.33 0.50 0.50 0.50 0.50 0.50 
1 2D 0.33 0.00 0.33 0.25 0.33 0.25 0.25 0.25 OVS 0.25 
3 0.33 0.33 0.00 0.25 0.33 0.25 0.25 0.25 0.25 0.25 
1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
2; D 0.50 0.50 0.50 0.50 0.50 0.33 0.67 0.67 0.33 0.50 
B 0.50 0.50 0.50 0.50 0.50 0.67 0.33 0.33 0.67 0.50 
Table 5.5 
Jackknife Weights for Fractional Imputation 
Obs. Replicate 
1 2 3 4 5 6 i 8 9 10 
1 0 nt 1a sali tl itll Hei bieil ella lal 1. 1.1111 DeL 
2 0.1664 0 0.3206 0.4205 0.3206 0.4563 0.3206 0.2392 0.3206 0.2724 
2, 0.6559 0 0.4400 0.3002 0.4400 0.2500 0.4400 0.5540 0.4400 0.5075 
2 0.2888 0 0.3505 0.3904 0.3505 0.4048 0.3505 0.3179 0.3505 0.3312 
3 0.3706 0.3706 0) 0.3706 0.3226 0.3706 0.5018 0.3706 0.2867 0.3706 
3 0.3697 0.3697 0 0.3697 0.5018 0.3697 0.0090 0.3697 0.6004 0.3697 
3 0.3708 0.3708 0) 0.3708 0.2867 0.3708 0.6003 0.3708 0.2240 0.3708 
4 0.3703 0.7407 0.7407 0) 0.3703 0.5556 0.5556 0.5556 0.5556 0.5556 
4 0.3704 0 0.3704 0 0.3704 0.2777 O2777 0.2777 O27 0.2777 
4 0.3704 0.3704 0 0 0.3704 0.2778 0.2778 0.2778 0.2778 0.2778 
5 Let 1.1111 1.1111 jae 0) thallialih 1 La eit Ge LT 1.1111 
6 ta a! Bit 1.1111 dala 1 0 ad 1 PT 5 Wa Bt bet. 
7 sla Eta a! Tet et Pelli meee! ‘Vita {halt bi bl 0 ill il lL ed A a | tol a 
8 jes ba leon jie i ll {Wet bf | j bat bite iba b| 1 si bt Wa] 1.1111 0) ‘leg It sls! Mla bil 
9 1.1111 iui PM Hea Peel 1 1.1111 1.1111 i, Ea 0) 1.1111 
10 0.1624 0.2777 0.2777 0.3061 0.2777 0.2286 0.3474 0.3013 0.1520 0 
10 0.3931 0.2778 0.2778 0.2494 0.2778 0.1417 0.3934 0.4395 0.2185 0) 
10 0.0932 0.2778 0.2778 0.3231 0.2778 0.4400 0.1483 0.0746 0.3171 0 
10 0.4623 0.2778 0.2778 0.2324 0.2778 0.3008 0.2220 0.2957 0.4235 0 
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6. Simulation Studies 


6.1 Study Parameters 


To study the properties of the imputation procedure we 
conducted a Monte Carlo study. The sample is a stratified 
sample with two elements per stratum and two imputation 
cells, where the cells cut across the strata. Cell one is 20% 
of the population in strata 1-25 and 80% of the population 
in strata 26-50. The probability of response is 0.7 for cell 
one and 0.5 for cell two. Two variables are considered. The 
variable D is always observed and defines a subpopulation. 
The probability that D = 1 is 0.25 for cell one and 0.40 for 
cell two. The variable y is subject to nonresponse with 
constant within-cell response probabilities. The variable D is 
independent of y and of the response probability. The 
variable y is normally distributed, where the parameters for 
a population of 50 strata are given in Table 5.1. In the data 
generating model of Table 6.1, there are no stratum effects. 
The parameters of interest are: 8, = mean of y, 0, = mean 
of y for D=1, 0, = fraction of Y’s less than two, 0, = 
fraction of Y ’s less than one. 


Table 6.1 
Parameter Set A 
Cell One Cell Two 
Element 
Strata Weight Mean Variance Mean Variance 
1-25 0.01 0.4 0.36 1.6 0.36 
26-50 0.01 0.4 0.36 1.6 0.36 


6.2 Estimation Procedures 


In the simulation M = 5 and M = 3 donors were used per 
recipient. Systematic samples were selected to serve as 
donors for each recipient. If the number of respondents in 
the cell is less then M, every respondent was used as a donor 
for every recipient and the Wi, are proportional to the 
original w, of the respondents. If there are more than M 
respondents in a cell, the donors are ordered by size and 
numbered from one to r,. Then the donors are placed in the 
order 1, 3,5, ...,7,51,4>ly-3>---»2 for r, odd and the order 
oe ee Oe Lp ers 2 for r, even. The cumulated 
sums of the weights are formed and m, systematic samples 
of size M are selected, where m eal lee The cumulative 
sums are normalized so that the grand sum is one, a random 
number, Rye» between zero and 0.2m r is selected and the 
m, samples are the systematic samples of size M defined 
by the donor associated with Ry, +0.2(s—1) +(t-1)m,, 
s=1,2,3,4,5 for recipients +=1, PGE Oe The initial 
imputation fraction for each donor is w;, = M i 

The initial imputation fractions are modified using the 
regression procedure of (18). The donors in a cell were 
ordered from smallest to largest and the cumulative sum of 
the weights formed. Let 
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t 
S =>, Mayb@Args (20) 

i=] 
where w,,,i=1,2,...,7,, 18 the weight of y, , and 
Yea) S+++S gyn) are the ordered y—values in cell g. To 
define the boundaries of groups to be used to create 


indicator functions, let ¢,, be the t for which 
aS <0.25S i, 


max iS, ty uo, ys 


for s=1, 2, 3,4, where S ae is the total of the weights of the 
donors in cell g. Define 


Cio scad Tt daw: Sa Vcr tanys ANGLE Ans 


= 0 otherwise (21) 


for s = 1, 2, 5, 4 and let 2 = OV 49.2519 -2-9 Seg) Le 
regression modified imputed estimator of the mean for each 
of the five variables in the z—vector is the fully efficient 
estimator of the respective mean. 

The k-deleted FE estimator of the cell mean of z is 
defined in (19). The initial fractional weight for donor k to 
element j is set at Wyo =0.01w,,. This initial weight 
assures that the final weight will be small, but permits 
regression adjustment. The final w,? are computed using 
the regression procedure of (18) using the initial weight 

*(k) 


6.3 Monte Carlo Results 


The Monte Carlo results for 5,000 samples generated by 
the parameters of Table 6.1 are given in Table 6.2 and Table 
6.3. Results are given for the full sample, for fractional 
imputation with 5 donors, fractional imputation with three 
donors, and for multiple imputation (MI) using the 
Approximate Bayesian Bootstrap (ABB) of Rubin and 
Schenker (1986) with M@ = 5 and ABB with M = 3. Both the 
FI and MI procedures are unbiased for all four parameters of 
Table 6.2. The last column of Table 6.2 gives the Monte 
Carlo variance of the estimator divided by the Monte Carlo 
variance of the FI procedure with M = 5, expressed in 
percent. The FI procedure is five to ten percent more 
efficient than MI with M = 5 and 9 to 13 percent more 
efficient than MI with M = 3. 

Under the model, the mean of the observed values is not 
the best estimator of the domain mean. In this example, the 
FI estimator is about as efficient as the full sample 
estimator. The effect of a smaller number of observations is 
balanced by the use of a superior estimator of the mean for 
the domain. Under the model, the domain indicator is 
independent of the y values, given the cell. Therefore it is 
efficient to use all values in the cell as donors, not just 
respondents in the domain. 

The properties of the variance estimators are given in 
Table 6.3. The column headed “Relative Mean” gives the 
Monte Carlo estimated mean of the estimated variances 
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divided by the Monte Carlo estimated variance, where the 
Monte Carlo estimated variance is given in Table 6.2. Both 
variance estimation procedures appear to be nearly unbiased 
for the variance of the mean. The relative variance of the MI 
variance estimator for M = 5 is nearly twice that of the FI 
variance estimator for M = 5. For M = 3, the MI variance 
estimator is more than three times that for FI. The MI 
variance estimator has a large variance because the variance 
due to missing observations is estimated with four degrees- 
of-freedom for M = 5 and with two-degrees-of freedom for 
M=3) 

The MI variance estimator for the domain mean is 
seriously biased. This property was first identified by Fay 


Fuller and Kim: Hot Deck Imputation for the Response Model 


(1991, 1992) and studied by Meng (1994) and Wang and 
Robins (1998). The FI variance estimator for the domain 
mean also has a positive bias, though much smaller than that 
of MI. The bias in the FI variance estimator can be reduced 
by increasing M, but the bias of MI has little relationship to 
M. 

All variance estimators for the variance of 6 , are slightly 
negatively biased. We believe FI is slightly biased for 6, 
because, although we use the z-vector, the weights are 
slightly smoothed by the regression procedure. MI is known 
to have a small sample bias. See Kim (2002). 


Table 6.2 
Mean and Variance of the Point Estimators Under Setup A (5,000 Samples of Size 100) 

Parameter Imputation Scheme Mean Variance Stand. Var. 

Mean Complete Sample 1.00 0.00570 67 

(8;) FI(3) 1.00 0.00849 100 

ABB(3) 1.00 0.00926 109 

FI(5S) 1.00 0.00849 100 

ABB(5) 1.00 0.00903 106 

Domain Mean Complete Sample 1.14 0.02020 99 

(8,) FI(3) 1.14 0.02050 100 

ABB(3) 1.14 0.02230 109 

FI(5) 1.14 0.02040 100 

ABB(5) 1.14 0.02170 106 

PrY< 2) Complete Sample 0.87 0.00104 51 

(83) FI(3) 0.87 0.00202 100 

ABB(3) 0.87 0.00228 113 

FI(5) 0.87 0.00202 100 

ABB(S5) 0.87 0.00223 110 

PeOe <b) Complete Sample 0.50 0.00208 66 

(84) FI(3) 0.50 0.00313 100 

ABB(3) 0.50 0.00342 109 

FI(5) 0.50 0.00313 100 

ABB(5) 0.50 0.00329 105 

Table 6.3 


Relative Mean, t-statistic and Relative Variance for the Variance Estimators Under Setup A 
(5,000 Samples of Size 100) 


Parameter Method Relative Mean (%)** t—statistic* Relative Variance (%) 
Mean FI(3) 100.1 0.05 5.66 
(8,) ABB(@) 99.6 —0.19 19.25 
FI(5) 100.1 0.03 5.65 
ABB(5) 98.2 —0.89 9.95 
Domain Mean FI(3) 115.9 7.54 13.88 
(8, ) ABB(3) 127.9 (gf? 28.88 
FI(5) 106.6 3.14 11.62 
ABB(5) 128.4 13.43 20.03 
Prit<2) FIG) 103.9 1.86 13.90 
(83) ABB(3) 100.8 0.36 48.42 
FI(5) 101.7 0.82 12.07 
ABB(5) 98.5 —0.67 25.10 
Pr(an< ) FI(3) 98.5 —0.75 4.67 
(8,4) ABB(3) 96.3 —1.80 18.51 
FI(5) 97.6 —1.20 4.45 
ABB(S5) 96.7 —1.65 10.17 


* Statistic for hypothesis that the estimated variance is unbiased. 
*** Monte Carlo mean of variance estimates divided by Monte Carlo variance of estimates, in percent. 
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In a second set of parameters, denoted by C, the means 
were as follows: 


Celllof stratal—25; wu=0.4 
Celll of strata 26-50; u=3.0 
Cell 2 of stratal—25; [W=1.6 
Cell:2 of strata 26 —50; = 2.2. 


All other parameters are the same as in parameter set A. The 
properties of the estimators are given in Table 6.4. Both FI 
and MI produce unbiased estimates of the means and of the 
domain mean. As with parameter set A, the FI procedure is 
eight to twelve percent more efficient than MI for M =5 and 
14 to 16 percent more efficient for M = 3. 

The assumptions required for MI variance estimation are 
not satisfied for parameter set C. Therefore the MI estimated 
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variance is seriously biased for all parameters. See Table 
6.5. The bias in the MI estimated variance with M = 5 is 
about 17% for the variance of the overall mean and nearly 
50% for the domain mean. The bias of the MI variance of 
the mean for a binomial variable is smaller than the bias for 
the mean of the continuous variable because the stratifica- 
tion effect is smaller for the binomial variable. 

The properties of the estimated variances for the FI 
procedures are similar to those for setup A. There is a 
positive bias for the variance of the domain mean of about 
23% for M = 3 and about 6% for M= 5. 

The variance of the MI estimated variance is 2.4 to 3.5 
times the variance of the FI estimated variance for M = 5 
and 3 to 7 times for M = 3, demonstrating the clear supe- 
riority of the FI variance estimator for this configuration . 


Table 6.4 
Mean and Variance of the Point Estimators Under Setup C (5,000 Samples of Size 100) 
Parameter Imputation Scheme Mean Variance Stand. Variance 
Mean Complete Sample 210 0.00500 48 
(8;) FI(3) 2.10 0.01050 100 
ABB(3) 2.10 0.01220 116 
FI(5) 2.10 0.01050 100 
ABB(5) 2.10 0.01150 110 
Domain Mean Complete Sample 0.02530 102 
(8,) FI(3) 2.01 0.02510 101 
ABB(3) 2.01 0.02850 mS 
FI(5) 2.01 0.02480 100 
ABB(5) 2.01 0.02710 109 
Pr(e<2) Complete Sample 0.00127 45 
(83) FI(3) 0.45 0.00281 100 
ABB(3) 0.45 0.00322 115 
FI(5) 0.45 0.00280 100 
ABB(5) 0.45 0.00314 12 
Er ex. 1) Complete Sample 0.00107 54 
(84) FI(3) 0.15 0.00199 100 
ABB(3) 0.15 0.00226 114 
FI(5) 0.15 0.00199 100 
ABB(5) 0.15 0.00214 108 
Table 6.5 
Relative Mean, t-statistic and Relative Variance for the Variance Estimators Under Setup C (5,000 Samples of Size 100) 
Parameter Method Relative Mean (%) t—statistic* Relative Variance (%) 
Mean FI(3) 100.9 0.41 6.42 
(8,) ABB@) 116.7 esi 40.14 
FI(5) 100.8 0.39 6.42 
ABB(5) 117.1 7.99 22.29 
Domain Mean FI(3) 27 10.78 16.23 
(85) ABB(3) 144.4 19.79 46.05 
FI(5) 106.1 2.95 11.95 
ABB(5) 148.7 DESI 32.49 
Pr(va< 2) FI(3) 104.4 2.18 6.63 
(83) ABB(3) 114.7 6.54 ay) By) 
FI(5) 101.8 0.89 6.42 
ABB(5) (Wai 5.74 20.67 
Praca) FI(3) 102.3 jit} 11.08 
(84) ABB(3) 101.3 0.58 39.14 
FI(5) 99.9 —0.04 10.05 
ABB(S5) 102.2 1.04 23.60 


* Statistic for hypothesis that the estimated variance is unbiased. 
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7. Summary 


In fractional imputation, several donors are used for each 
missing value and each donor is given a fraction of the 
weight of the nonrespondent. If all donors are used, the 
procedure is fully efficient, under the model, for all 
functions of a y—vector. It is shown that the use of fractional 
imputation with a small number of imputations per non- 
respondent can give a fully efficient estimator of the mean. 
Estimates of other parameters, such as estimates of the 
cumulative distribution are nearly fully efficient. 

Fractional imputation permits the construction of general 
purpose replicates for variance estimation. A single set of 
replicates can be used for variance estimation for imputed 
variables, variables observed on all respondents, and under 
model assumptions, for functions of the two types of 
variables. The replicates give estimates of the variances of 
domain means with much smaller biases than those of 
multiple imputation. The bias goes to zero as M increases 
and, in the simulation, is modest for M = 5. The replication 
variance estimator is easily implemented with replication 
software such as Wesvar. 

Fractional imputation with a fixed number of donors per 
recipient is slightly more efficient for the mean than 
multiple imputation with the same number of donors. 
Fractional imputation gives variance estimates with smaller 
bias and much smaller variance than multiple imputation 
estimators with the same number of imputations. 
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Variance Estimation with Hot Deck Imputation: 
A Simulation Study of Three Methods 


J. Michael Brick, Michael E. Jones, Graham Kalton and Richard Valliant ' 


Abstract 


Complete data methods for estimating the variances of survey estimates are biased when some data are imputed. This paper 
uses simulation to compare the performance of the model-assisted, the adjusted jackknife, and the multiple imputation 
methods for estimating the variance of a total when missing items have been imputed using hot deck imputation. The 
simulation studies the properties of the variance estimates for imputed estimates of totals for the full population and for 
domains from a single-stage disproportionate stratified sample design when underlying assumptions, such as unbiasedness 
of the point estimate and item responses being randomly missing within hot deck cells, do not hold. The variance estimators 
for full population estimates produce confidence intervals with coverage rates near the nominal level even under modest 
departures from the assumptions, but this finding does not apply for the domain estimates. Coverage is most sensitive to bias 
in the point estimates. As the simulation demonstrates, even if an imputation method gives almost unbiased estimates for the 


full population, estimates for domains may be very biased. 


Key Words: Adjusted jackknife; Domain estimation; Model-assisted variance estimation; Multiple imputation; 


Nonresponse. 


1. Introduction 


Imputation is frequently used in survey research to assign 
values for missing item responses, thereby producing 
complete data sets for public use or general analysis. It is 
well-recognized that treating imputed values as observed 
values results in downwardly biased variance estimates for 
the survey estimates. As a result, confidence intervals have 
lower than nominal levels. The biases in the variance 
estimates tend to increase with the item nonresponse rate 
and can be substantial when that rate is high. 

Three methods of variance estimation that have been 
developed for use with imputed data are studied here: a 
model-assisted method (Sarndal 1992), an adjusted jack- 
knife method (Rao and Shao 1992), and multiple imputation 
(Rubin 1987). Each method has been evaluated theoretically 
and by simulation methods, primarily under conditions 
consistent with the assumptions of the methods. This paper 
uses simulation to compare the three methods under the 
same experimental conditions in which some of the assump- 
tions required by the methods do not hold. The goal is to 
examine the relative performances of the methods in 
situations that are likely to occur in practice. Other simu- 
lation studies of variance estimation methods with imputed 
data have generally been more limited. Even the more 
extensive simulation study by Lee, Rancourt, and Sarndal 
(2001) was based on small populations and it did not 
include multiple imputation. 


A single-stage disproportionate stratified sample selected 
from a real population data set is used to evaluate these 
variance estimation methods in a realistic setting. The 
imputed values are assigned using a hot deck imputation 
method, one of the most popular methods of imputation in 
survey research. Since hot deck imputation is a form of 
regression imputation (Kalton and Kaspryzk 1986), re- 
stricting the simulation study to the hot deck is not a crucial 
feature for examining the implications for variance estima- 
tion. We study estimation for both full population and 
domain totals. For the domain estimates, the domain indi- 
cator is assumed to be known for all sample members. 

Three different combinations of missing data mecha- 
nisms and hot deck cell formation are used in the simula- 
tions to assess the performance of the variance estimation 
methods under conditions that violate the assumptions of the 
methods to varying degrees. The three variance estimation 
methods we study all assume that data are randomly missing 
in each hot deck cell and the model-assisted (MA) and 
multiple imputation (MI) methods also assume that a simple 
model with common mean and variance holds in each cell. 
Studying the robustness of the variance estimation methods 
is an important feature of the simulation because in practice 
the assumptions underlying the methods will almost never 
be fully satisfied. 

The next section briefly describes three variance estima- 
tion methods with hot deck imputed data. The third section 
outlines the study population, the sample design used in the 
simulations, and the methods used to generate the missing 
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data and implement the hot deck imputations. The fourth 
section gives the results of the simulations. The last section 
gives some conclusions about the methods and their applica- 
bility. 


2. Description of the Variance 
Estimation Methods 


We denote the full sample by A, the subset that responds 
to an item by Az, and the subset that does not respond by 
Ay. For the imputations the units are divided into hot deck 
cells indexed by g=1,...,G, where the subset of Ng 
respondents in cell g is A,g,, and the subset of non- 
respondents is A,,,. For each unit with a missing value, the 
hot deck method consists of randomly selecting a 
respondent from within the same hot deck cell to be the 
donor of the imputed value. 

With hot deck imputation, donors are often selected 
within a cell by simple random sampling with replacement 
(srswr), by simple random sampling without replacement, or 
by sampling with probabilities proportional to the survey 
weights with replacement (ppswr). Since the simulation 
results obtained using the srswr and the ppswr methods are 
very similar, only the results for the ppswr method—termed 
the weighted hot deck—are presented here. The imputed 
estimator of a population total is 6 1 = Lies, Wi Yi + 
Dies, Wi; Where w, is the survey weight, y, is the 
reported value and y, is the imputed value for unit i in the 
nonrespondent set. 


2.1 Mbodel-Assisted Variance Estimation 


The model-assisted (MA) approach with hot deck 
imputation assumes that data are randomly missing within 
the hot deck cells and that a model for the generation of the 
y’s holds. A natural model for use with hot deck imputation 
is that the y,’s are independently = eles generated 
within the hot deck cells, ie. y,; ~ “(ul wot ) for cell g. 
Inferences from the model-assisted approach depend on the 
validity of the model assumptions. 

Sarmndal (1992) decomposed the total variance of the 
imputed estimator into three components denoted by 
Voam> Vimp> and Vix. The estimators used for these com- 
ponents in the simulations are those given in Brick, Kalton, 
and Kim (2004). The MA variance estimator is the sum of 
the component estimates: V\,, =Vsaym + Vamp + 2Vanx- The 
Viup and V\,y estimators require an estimator of the 
element variance in each hot deck cell. Since the simu- 
lations showed little difference between weighted and un- 
weighted estimators = the weighted estimator of om iS 
I" Dg, Wii -Yag) 
at By Wi Yi (Disp, W; 4 
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2.2 Adjusted Jackknife Variance Estimation 


The Rao and Shao (1992) adjusted jackknife (AJ) 
variance estimator for a stratified sample with imputations 
and ignorable finite population correction factors ( fpc’s) is 


(6) -6,)? 
where n,, is the number sampled in stratum h, 


8 G 
an = 5 Wa Yai + 


g=l | (hi)e Age 


ys Wy Aue. a Vee 


(hj Ee Aug 


rl 


is the adjusted estimator when unit k is omitted, 


P= > wiry il we), 


(hi)e Ary (hi)e Ape 
) nealst Ds Whi Y hi ee Whi» 
(hi)e Agy (hi)e Agy 


wi is the weight for unit hi adjusted to account for the 
sane of unit k. The notation (hi) € B denotes unit 7 in 
stratum h is part of set B. This procedure requires the 
computation of Yn, replicate estimates, 6%. A 
commonly used strategy to reduce the computations is to 
combine units into variance strata (e.g., see Rust and Rao 
1996). Let h* denote a combined variance stratum and k a 
group of sample units within the combined stratum. All 
sampled units are assigned to one of the groups. Then, the 
grouped adjusted sae variance estimator is 


= DDO -6)7 
ogk=| ne 

where n,. is the number of sample units in combined 
variance stratum h’, n.. n(k) is the number of units retained in 
stratum h° when units in group k are deleted and, 
corresponding to is ek ) is the adjusted imputed estimate 
for the full population Whels units in group & in stratum h* 
are deleted. The retained units from design stratum / that 
are in combined variance stratum h” are assigned replicate 
weights of We = =n. CE 

The AJ riethiod assumes a uniform response probability 
model within each hot deck cell but, unlike the MA method, 
it does not require distributional assumptions. Under the 
uniform response probability model without distributional 
assumptions, a weighted hot deck is needed to produce 
unbiased imputed estimates. 

In developing the theory for the AJ method, Rao and 
Shao (1992) assume that fpc’s are ignorable. However, the 
jpc’s are not negligible in some strata in the simulations, 
ranging from about 0.05 to 0.24. Shao and Steel (1999) and 
Lee, Rancourt, and Sarndal (1995) provide methods for 
accounting for nonnegligible fpc’s. The Lee, Rancourt, and 
Samdal (1995) jfpc adjustment was applied in the simu- 
lations because of its ease of implementation. Without the 
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fpe adjustment, the AJ variance estimator substantially 
overestimated the variances in the simulations. 


2.3 Multiple Imputation 


Multiple imputation (MI) is described in detail in Rubin 
(1987) and Little and Rubin (2002). The summary here 
relates to its application with hot deck imputation. As with 
the model-assisted approach, within the hot deck cells re- 
sponses are assumed to be missing randomly and the y’s are 
assumed to be independent random variables with a com- 
mon mean and variance. For each unit that has a missing 
value, M values are imputed, creating M completed data 
sets. 

To avoid underestimation of variances with the MI 
method, the hot deck method needs to be modified. Rubin 
and Schenker (1986) proposed the approximate Bayesian 
bootstrap (ABB) for simple random sampling with hot deck 
imputation for use with the MI method. The ABB was 
modified for the simulations to accommodate sampling 
donors by ppswr. In the simulations a donor pool for the 
ABB was created in each cell by selecting respondents with 
replacement with probabilities proportional to w,. (There is 
no literature that discusses the application of ABB methods 
with unequal weights. In hindsight, an unweighted ABB 
might have been preferable. The use of an unweighted ABB 
with a ppswr hot deck yields unbiased point estimates of 
population totals under the response probability model). 


3. Design of the Simulation Study 


3.1 Description of the Study Population and Sample 
Design 


The sampling frame for the simulations is a subset of the 
file of public school districts extracted from the 1999-2000 
Common Core of Data (CCD) compiled by the US. 
National Center for Education Statistics. The final frame 
consists of 11,941 districts. 
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The sample design used in the simulations is a stratified 
simple random sample of 1,020 school districts. Twelve 
strata were created by cross-classifying four categories of 
number of students (district size) by three categories of the 
percentage of students at or below the poverty level (poverty 
status). The strata and number of districts in the frame are 
given in Table 1. The table also gives the stratum sample 
sizes and sampling rates used in the simulations. 

The table also contains the stratum means and standard 
deviations for the two study variables, the number of 
students in the district and the number of districts that 
include pre-kindergarten as the lowest grade. These study 
variables were chosen because they are typical of many 
estimates computed from this type of design. 

In addition to the full population estimates we computed 
the two study estimates for two domains, defined as districts 
located in the Northeast region and those in nonmetropolitan 
areas. The means for these domains are substantially 
different from the full population means for both study 
variables. 


3.2 Missing Data Mechanisms and Imputation 
Methods 


By construction, information on the two study variables 
is available for all districts in the sampling frame. To create 
missing values, response indicators were assigned to 
sampled units within “response cells’. In some cases the 
response cells are the sampling strata, termed STR cells, 
whereas in other cases they are what are termed HD cells. 
The HD cells were defined by the cross-classification of 
four geographic regions and a fourfold categorization of the 
number of full time equivalent teachers in the district. The 
HD cells are somewhat correlated with the sampling strata, 
but each cell contains units from more than one stratum. 


Table 1 
Stratum Definitions, Population Counts, Sample Sizes, Sampling Rates, Means and Standard Deviations of Number of Students 
and Proportions of Districts with Pre-Kindergarten 


District Poverty 

Stratum size status Nj, 
1 1 1 615 

2 1 2 1,147 

) 1 3 1,292 

4 2 i 1,720 

5 2 2 2,305 

6 2 3 1,893 

7 3 1 692 

8 3 2 579 

9 3 3 Sah 

10 4 1 342 

11 4 2, 449 
12 4 3 380 
Total 11,941 


Sampling Number of students Proportion with 

Np rate Mean Std. dev. pre-kindergarten 
32 0.0520 270.0 155.0 0.44 
59 0.0514 263.3 175.0 0.49 
66 0.0511 243.5 142.5 0.49 
111 0.0645 1,607.2 837.0 0.44 
149 0.0646 1,429.7 784.1 0.52 
122 0.0644 1,427.8 788.8 0.63 
As 0.1084 4,695.3 1,360.6 0.35 
63 0.1088 4,728.5 1,365.0 0.51 
ii 0.1082 4,591.8 1,380.3 0.63 
83 0.2427 16,003.4 12,670.2 0.51 
110 0.2450 PSS: 14,246.7 0.58 
93 0.2447 19,331.8 16,142.7 0.68 
1,020 SOR WAS, 6,770.5 0.52 
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Within a given response cell, sampled units were 
assigned at random to be missing or nonmissing at a 
specified rate. For each type of response cell, three schemes 
for assigning rates of missingness were chosen. In two of 
the schemes, the rates of missingness varied across the 
response cells, whereas in the other scheme the rate was 
constant across the cells. 

The simulations were conducted by first drawing a 
stratified simple random sample using the stratum sample 
sizes in Table 1. Once the sample was selected, response 
status (respondent/nonrespondent) was randomly assigned 
to each sampled unit according to the given response 
scheme. For the MA and AJ methods, the weighted hot deck 
imputation procedures described earlier were used to impute 
for missing values. For the MI method, a donor pool was 
first created using the weighted ABB, and weighted hot 
decks were then used to impute for each of the M=5 
imputed data sets. The estimated total numbers of students 
and districts with pre-kindergarten were computed for the 
simulated sample with imputed values, and variance 
estimates were computed for these estimates using the three 
variance estimation methods. (If the estimated variance 
could not be computed in a particular simulation run or the 
sample size in a cell was less than 2, then that sample was 
deleted. The maximum number of deleted samples across all 
the simulations of 10,000 runs each was 2 for the MA 
method and 28 for the AJ (only one run had 28 AJ samples 
deleted; the next largest number was 3). The AJ method was 
based on three combined variance strata and 40 groups of 
units per stratum for a total of 120 replicates. The three 
combined strata, formed from strata having about the same 
jpc, consisted of strata 1-6, 7-9, and 10-12. As a check of 
the grouping, we verified that the grouped jackknife 
variance procedure gave essentially the same average 
variance estimates and confidence interval coverage rates as 
the ungrouped jackknife in the case of complete response. 
The entire process was repeated 10,000 times for each 
response scheme. 

A feature of the design of the simulation is that the means 
for the two domains considered often differ substantially 
from the full population means by strata and HD cells. A 
key point for the domain estimates is that imputations were 
made by selecting donors from all the respondents in a hot 
deck cell, without specifically recognizing the domain as 
might be done in practice for some domains. After impu- 
tations were made for the full sample, the estimated total for 
a domain was estimated by 6 1 = Lied, O;W;Yi + Die ay 
5,w,y; where 6, =1 if unit 7 is in the domain and 0 if not. 

Three of the four possible combinations of response 
mechanism (STR or HD cells) and hot deck cell formation 
(STR or HD cells) were studied in the simulations. We refer 
to these combinations as STR/STR, HD/HD, and STR/HD, 
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where the first set of letters identifies the response 
mechanism and the second set identifies the type of hot deck 
cell. The three sets of response rates were 0.2 to 0.6 spaced 
evenly across the response cells, a constant 0.7 in all cells, 
and 0.6 to 0.9 spread evenly across the cells. The three 
combinations of response/hot deck cells with the three sets 
of response rates generated nine separate simulation 
schemes for each estimate. 


3.3 Assumptions for Models of Response and 
Population Structure 


There are two models involved in the simulations. The 
population model assumes that the y values within each hot 
deck cell are independent and have the same expected value. 
The response model assumes that there is a uniform 
response probability within each hot deck cell. If both 
models hold, then the use of either an unweighted or a 
weighted hot deck will lead to an unbiased estimate of the 
overall population total. However, if only the response 
model is assumed, then the use of a weighted hot deck is 
needed to produce an unbiased estimate of the overall 
population total. Since the weighted hot deck is used in the 
simulations, only the response probability model needs to be 
satisfied for unbiased point estimation of the overall popu- 
lation total. The response probability model holds for all the 
STR/STR and HD/HD combinations and for the STR/HD 
combination with a constant response rate; however, it does 
not hold for the other two STR/HD combinations. The AJ 
theory for variance estimation of population totals was 
developed assuming only the response probability model. 
The MA and MI theories assume that both models hold. 

Reliance on only the response probability model and the 
weighted hot deck to produce unbiased estimates of 
population totals does not in general extend to estimates of 
domain totals. When domains cut across hot deck cells, it is 
necessary to invoke a population model that assumes that 
the expected value of the domain values is the same as that 
of the nondomain values in each hot deck cell. However, if 
the hot deck cells are defined such that each domain 
comprises the full population in a subset of the hot deck 
cells, then the situation for point and variance estimation is 
the same as stated above for overall population totals. 

The simulation schemes were generally constructed so 
that the hot deck cells do not incorporate the domains in 
order to reflect the practical consideration that it is 
essentially impossible to incorporate all domains in an 
imputation scheme. Specifically, in the simulations the 
districts in the Northeast (NE) region and districts in 
nonmetropolitan statistical areas (NMSA) are unrelated to 
the stratum definitions in Table 1 (which are used as hot 
deck cells in some cases). Also, districts in the NMSA 
domain can be found in all HD cells. However, the NE 
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domain is a subset of four of the HD cells. Thus, the 
definition of the HD cells is more consistent with estimating 
NE domain totals than NMSA domain totals. 


3.4 Summary Statistics 


The relative bias of a point estimate is estimated by 
relbias(6, )=bias(6,)/0,,, where bias(6,)=>, (6,,-0,)/ 
10,000, 6 j; 18 the estimate from sample s, and 9,, is the 
finite population parameter. The empirical variance of 6 : 
is Var(6,) =>, (6,, —9,)° /10,000, where 8, =>,6,,/ 
10,000. The average variance estimate for a particular 
method is v=>,v,/10,000, where v, is the estimated 
variance for simulation run s. 

The percentages of intervals that include 9, are based 
on the nominal 95 _ percent confidence intervals 
(6,+¢V'’*) computed for each of the 10,000 simulations 
for each simulation scheme. An issue to consider here is the 
precision of the variance estimates from a disproportionate 
stratified sample design and its impact on whether normal 
approximation or ¢ intervals should be used to calculate 
confidence intervals. We found that the use of the 
t—distribution did not have a substantial effect for most 
cases with the MA and AJ methods, and we have therefore 
used a multiplier of 1.96 for confidence intervals based on 
these methods. Rubin and Schenker (1986) suggest using a 
t-distribution with 4 degrees of freedom for confidence 
intervals with the MI method, where 


v/Var (%) 


155 


2 

XK =(M -1) 1+ stlecy 
(a a 8 

Since using 1.96 with the MI method yielded intervals 

that had severe undercoverage, the t—distribution with A 


degrees of freedom is used for the MI confidence intervals. 


4. Simulation Results 


This section presents the main results from the 
simulations, beginning with the performance of the three 
methods of variance estimation for estimates from the full 
population, followed by the results for the domain estimates. 
Key outcomes are summarized here graphically, but tables 
with full details are available in Brick, Jones, Kalton, and 
Valliant (2004). 


4.1 Full Population Estimates 


Figure 1 shows the results of the simulations for 
estimating the total number of students and the number of 
districts offering pre-kindergarten from the 10,000 samples 
for each of the nine simulation schemes. The figure gives 
the relative bias of the imputed estimator, the average 
variance estimate as a percentage of the empirical variance, 
and the confidence interval coverage rate. 
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The point estimates are theoretically unbiased with 
weighted hot deck imputation if all units in a hot deck cell 
have the same response probability. As noted earlier, this 
condition holds for the STR/STR and HD/HD combinations 
and also for the STR/HD combination with a uniform 
overall response probability. The graph of relative biases in 
Figure 1 is consistent with this theoretical result within the 
bounds of simulation error. While the relative biases of the 
point estimates in the other two STR/HD schemes are small 
(always less than 3%), they still may be important if the 
standard errors of the estimates are also small. Cochran 
(1977, page 12) shows that when the ratio of the bias to the 
standard error is relatively large, then the coverage rate can 
be much lower than the nominal level. For the full 
population estimates with this sample size the ratios never 
exceed 0.4, but much larger ratios occur for domain 
estimates, as discussed later. 

The graph of the ratios of the average variance estimates 
to the empirical variances (v/Var in the figures) for the three 
methods shows that these estimates have relatively small 
biases in most cases, within a range of plus or minus 8 
percent around the simulated true variance. While the ratios 
for all the methods vary across the nine schemes, the MI 
ratios are slightly more variable than the other two. 

A primary reason for computing variances is to produce 
confidence intervals. The right-hand panel in Figure 1 
shows that the coverage rates for the confidence intervals 
for the estimates are generally close to the nominal 
95 percent level, especially for the pre-kindergarten statistic. 
The coverage rates for both statistics and all the methods 
and schemes are between 91% and 96%, with the exception 
of the number of students for the STR/HD 0.2 to 0.6 
scheme. The coverage rates of 88% or less for all three 
methods in this case, with its extremely high rate of 
nonresponse, are due to the relatively large bias in the point 
estimate. Overall, all three variance estimation methods 
produce confidence intervals with coverages that are vast 
improvements over those for intervals based on naive 
variance estimates (Brick et al. 2004). 

The confidence interval coverage rates for the MA and 
AJ methods are essentially equivalent. The MI coverage 
rates are generally slightly greater than those for the MA 
and AJ methods. The MI coverage rates are slightly closer 
to the nominal level for the number of students. Most of the 
differences are small. 

For all three variance estimation methods, the upper and 
lower confidence interval coverage rates were similar. For 
the number of students, which is a highly skewed variable, 
the coverage rates in the two tails are unequal due to 
correlation between the estimated total and the standard 
error estimates. The asymmetric tail coverages are also 
associated with lower overall coverage rates. 
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The MA and AJ methods yield confidence intervals that 
have nearly the same average length across the schemes and 
variables. Because the MI method uses t-distribution 
values, its intervals range from 10 to 20 percent longer than 
the MA and AJ intervals when the response rates are low. 
With the higher response rates, the MI intervals range from 
about the same to 5 percent longer than the intervals from 
the two other methods. The MI confidence intervals could, 
of course, be shortened by increasing M (Rubin 1987, 
Chapter 4), even though M = 5 is typical for applications. 


4.2 Domain Estimates 


Estimating characteristics for domains that are not 
explicitly incorporated in the imputation scheme can be 
problematic when the missing data rate is not trivial. Kalton 
and Kaspryzk (1986) and Rubin (1996) along with many 
others have discussed this point and urged the inclusion of 
as many variables as possible in the imputation process. 
However, given the many preplanned and ad hoc domain 
analyses that are carried out with survey data, it is 
unrealistic to assume that all domains can be accounted for 
in an imputation scheme. For this reason, the design of the 
simulations intentionally did not include the domains 
explicitly in the definition of the hot deck cells. In the case 
of multiple imputation, issues of variance estimation for 
domain estimates have received much attention (e.g., Fay 
1992; Meng 1994; Rubin 1996). 

In the simulations we estimate the totals for two 
domains: school districts in the NE and those in NMSA. 
Figures 2 and 3 present the results of the simulations for the 
NE domain and for the NMSA domain, respectively, in the 
same format as used before. Note that the scales for Figures 
2 and 3 differ from each other and are very different from 
those used for the full population estimates. 

For the NE domain, the point estimates have large 
positive biases for the STR/STR combinations. Hot deck 
cells based on STR are not related to region, and, as a result, 
NE districts with missing data have donors from other 
regions, which have different characteristics. In contrast, the 
inclusion of region in the construction of the HD imputation 
cells removes the bias of the point estimates in the HD/HD 
combinations and the STR/HD combination with uniform 
overall response probability, and reduces the bias in the 
other STR/HD combinations. 

All three methods of variance estimation require 
unbiased point estimates and theory for the methods does 
not provide guidance on how the methods will perform 
under the conditions we study. The variance estimates are 
approximately unbiased for all three variance estimation 
methods when the domain point estimates are unbiased or 
have only small biases. However, Figure 2 shows that for 
the STR/STR combination, where the point estimates are 
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seriously biased, the variance estimates usually overestimate 
the empirical variances. 

Figure 2 shows that the coverage rates for the HD/HD 
and STR/HD schemes-for which the point estimates have 
no or small relative biases—are between 92 percent and 
96 percent for all but one of these schemes and variance 
estimation methods. The exception is the STR/HD 
combination with response rates between ().2 and 0.6, which 
has coverage rates as low as 86 percent for the number of 
students. 

For the STR/STR schemes, Figure 2 shows that all the 
methods tend to cover at greater than the nominal level for 
the number of students and less than the nominal level for 
the number of districts with pre-kindergarten. The 
difference in the coverage rates for the two variables is due 
to the sizes of the relative bias of the point estimates and of 
the variance estimates. 

Turning to the NMSA domain estimates in Figure 3, note 
that metropolitan status is not explicitly included in the 
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definitions of either STR or HD, although it is clearly 
correlated with size and, thus, with STR. The point 
estimates for the number of students in the NMSA domain 
for all the schemes have substantial positive biases. The MA 
confidence intervals consistently cover at the nominal level 
or higher, primarily due to the extreme positive biases of the 
variance estimates. The AJ intervals cover at close to the 
nominal level for the HD/HD and STR/HD schemes, but 
undercover in the three STR/STR schemes. The patterns for 
the MI coverages are similar to those of the AJ, except that 
the MI intervals appreciably undercover in the HD/HD 
scheme with 0.2 to 0.6 response rates. 

The point estimates of the number of districts with pre- 
kindergarten in the NMSA domain have moderate negative 
relative biases for all nine schemes. The confidence 
intervals for all three methods of variance estimation are 
close to the nominal level, without the overcoverage found 
in the NE domain estimates. 
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Figure 2. 
with pre-kindergarten (A) in the Northeast. 
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Figure 3. 
with pre-kindergarten (A) in nonmetropolitan areas. 


5. Conclusions 


The simulations examined the performance of three 
variance estimators for imputed totals from a single-stage 
stratified sample design under different response 
mechanisms with weighted hot deck imputation. The 
circumstances reflected what can be expected in practice in 
the sense that the assumptions of the methods were violated 
in different ways. All three methods were substantial 
improvements over the naive variance estimator. All three 
methods performed very well with unbiased point estimates. 
When the point estimates had large biases, none of the 
methods produced confidence intervals with the nominal 
coverage levels. Poor coverage rates for biased point 
estimates are not unexpected since the same result holds 
with no missing data. When the point estimates had 
relatively small biases, the actual coverage rates for the 
three variance estimation methods sometimes exceeded and 
sometimes fell short of the nominal levels. In this case the 
tendency of all three methods to overestimate the variance 
often resulted in coverage rates close to the nominal level. 
Low response rates were associated with undercoverage, 
largely due to the greater biases in the point estimates. 
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Relative biases, variance ratios, and 95% confidence interval coverage for number of students (*) and number of districts 


The differences in the coverage rates of the three 
methods were generally too small and inconsistent to 
support claims that any one method is superior in general. 
With very low response rates, the average lengths of the 
confidence intervals for the MI method were appreciably 
longer than those for the MA and AJ methods, but using a 
larger number of sets of imputations with the MI method 
would rectify that problem. It should, however, be noted 
that these simulations only address single stage sampling. 
Differences in confidence interval lengths between methods 
may exist in cluster samples. This possibility awaits further 
investigation. 

The results of this study give practitioners of hot deck 
imputation empirical evidence that all of the variance 
estimation methods perform well in single stage samples 
provided that the point estimate is unbiased, even when 
other assumptions are violated. Estimates for domains that 
are not taken into account in the imputation scheme are 
susceptible to large biases. When the point estimates are 
seriously biased, the methods may produce confidence 
intervals that cover at far less than the nominal rate. 
Analysts of imputed data sets should examine whether the 
imputation method that has been used is likely to give 
approximately unbiased estimates, especially for domain 
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estimates. If not, they may need to re-impute the missing 
items to give less biased point estimates. Advice to imputers 
to take advantage of as many explanatory variables as 
feasible in the imputation process is not new, but the 
evidence from the simulations demonstrates its importance. 


Acknowledgements 


The authors would like to thank the National Center for 
Education Statistics, Institute for Education Sciences for 
supporting this research, and in particular Marilyn Seastrom. 
We also would like to thank the referees for their 
constructive comments. 


References 


Brick, J.M., Kalton, G. and Kim, J.K. (2004). Variance estimation 
with hot deck imputation using a model. Survey Methodology, 30, 
57-66. 


Brick, J.M., Jones, M., Kalton, G. and Valliant, R. (2004). A 
simulation study of three methods of variance estimation with hot 
deck imputation for stratified samples. Prepared under contract 
No. RN95127001 to the National Center for Education Statistics. 
Rockville, MD: Westat, Inc. 


Cochran, W.G. (1977). Sampling Techniques. New York: John Wiley 
& Sons Inc. 


Fay, R.E. (1992). When are imputations from multiple imputation 
valid. Proceedings of the Survey Research Methods Section, 
American Statistical Association, 227-232. 


Kalton, G., and Kasprzyk, D. (1986). The treatment of missing survey 
data. Survey Methodology, 12, 1-16. 


159 


Lee, H., Rancourt, E. and Sarndal, C.-E. (1995). Jackknife variance 
estimation for data with imputed values. Proceedings of the 
Statistical Society of Canada Survey Methods Section, 111-115. 


Lee, H., Rancourt, E. and Sarndal, C.-E. (2001). Variance estimation 
from survey data under single imputation. In Survey Nonresponse 
(Eds. R.M. Groves, D.A. Dillman, J.L. Eltinge and R.J.A Little), 
Chapter 21, New York: John Wiley & Sons Inc. 


Little, R.J.A., and Rubin, D.B. (2002). Statistical Analysis with 
Missing Data. New York: John Wiley & Sons Inc. 


Meng, X.-L. (1994). Multiple imputation inferences with uncongenial 
sources of input. (With discussion). Statistical Science, 9, 538- 
BYES 


Rao, J.N.K., and Shao, J. (1992). Jackknife variance estimation with 
survey data under hot deck imputation. Biometrika, 79, 811-822. 


Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. 
New York: John Wiley & Sons Inc. 


Rubin, D.B. (1996). Multiple imputation after 18+ years (with 
discussion). Journal of the American Statistical Association, 91, 
473-489. 


Rubin, D.B., and Schenker, N. (1986). Multiple imputation for 
interval estimation from simple random samples with 
nonignorable nonresponse. Journal of the American Statistical 
Association, 81, 361-374. 


Rust, K., and Rao, J.N.K. (1996). Variance estimation for complex 
estimators in sample surveys. Statistics in Medicine, 5, 381-397. 


Sarndal, C.-E. (1992). Methods for estimating the precision of survey 
estimates when imputation has been used. Survey Methodology, 
18, 241-252. 


Shao, J., and Steel, P. (1999). Variance estimation for survey data 
with composite estimation and nonnegligible sampling fractions. 
Journal of the American Statistical Association, 94, 254-265. 


Statistics Canada, Catalogue No. 12-001-XPB 


Auten 2 ast 


on Sherrer | 


(S021) 1h bo. + 


~ Yaar 
- a aif . 
oy 7 r d 
nf ox ~ Aer, ag arpa ts 
ee Pat ae | 


ASHE RTs . 
pa ghepograrchh 1A ojos Segall (THCY IA i 
were ener MANS AVERT! Peer ee +e 
we cdot hb Laisa 4 ee 23 


oa 99 


. 6.60me 


ho Pelle = 
ihe) sano RE” Mae eeanbepek atcha Dee) 1 sickest 
1, aihlitdisearah, er ncdnkt, asarosen, a) A \ndewicl. 


pimeetie Bes PRE ——— 
dl, seis siditnt “Naot Y trewio® te We aa items sonha' 
dui aalmenee cabot. .ietitly, saath a ncbrennttic {aun _ Fade gvodaalh eo 

linn. aac f ir ~ Sewn amt et Av : : a — a 

pid g me . mi wiih "AMOR seit ew sce. 

OL Dy) noimititne v2 ve wey AW Pe Je if bar oA pus dort dibner cv tauretiny oon _ “J 

TOr-.8E*% soviielh de wie? eangrra ober mi Winer beer ae 

» is) > (ete 


' rhe. 
mrs We conten  nithen rites ye? Sedat {fe01) Be oy hal arn nt 


py “ at’ ” 
ions evel : Sel aed nA Parr vere ratntes . 
wine a gene adel hol’ welt anapiatne’T ibouea tt v0 
Piss : ._ belie< a ary A. ESg St Ge PET het if intends < +) sed reantibe & 
qn Yaw at 7 Kae Sa ae ety’ 13 ua Topade . 


mom’ qniiqae at Bee dat ore welinkices Sasi a ns Komal siqiten mc eoolment oe 
Pig bt 34! won iv AYO ero St YO bert none? Qed, ders een aot ; 
Spehaoeyo line tom 
wee 


$. -Cimehastogs cA tl Wl NY He i Na 
nasthots wete yentrally aod wath tO 


yaw € ied ite feria Of Te woppyet clara iui aby ore Rares pip. 0 wx i a 
verae> stanton Or uous! tls from # sintie-stage Wik very Wier repotise rake. the avenge Ie fishes 4 
ied ; earn teder ile rego cofiiieec: tatervals Tor the Mi inecthet were & , ~ y 
vlanrarc: wih wevbiald ba Gave wnputate he longer Crt Uiese for the MA wel M inethods, ny | 
arcweneines seflectad tw) Cho be cious! dy practice te Janeen nuinbor of see of ipotshond Wilt ie Re 
e thal are clen of ibe ieducs wor aed ojukt rectify that phic, & b Rou: Sawever, 6 
sy Gtreoel , retire Were cobtarien! thal Cast ails Gy wddet si sicgiesnge tay 
remrwwenenth trex vathecd cibtatée AU deve  Cilfneenea in Confideset intedved leagone 
= ern! bh urtenedss Suc ofuraics. ‘we nia? «3 coer carn iis pmalilty we 
' vo ted Voice Bikes, teow OT ie yYCR CMON. 
inves intervils wah he oon oie resaits «if un 4 sad ¥ pve oceans 
Five wordtore tae for biased pon ingaurtion eruitied vviderce thal dll ut ats ” 
; ew pwecrad hor ther mast PS is is eset owrreds satu wel in 
" 2 co Whee cee feet divinities tcl «= provited Ther the Point mtinae oe uobisied, 3 
ty ornpll én, fhe. ackel conceage mites for De. other Pepin ert are “wartland, (tunes ford 


- ms ' medank 


peaey. 


as | a 


ct Ke oat methods envcones atoradel! and wi eX thee Ine tcooont i te Yoana son je 
(fr fell due of (hot Genel Sevelc io 'has cite The oti ae fie Ww lang fianes When te we ‘ ; utes /* 


anon te it} Yow 1a e SS * @eatunolie the vance acricstisty mee, Te methods ay @ 
ples) Oh COW ae (Wee & te moeerel tevel iptervale thar ayer af fax less 
mie mics. & scout We prberoovcrage, Aresiwstp ud lengua ita wets show 

y shie Oo the erezoee tug Tr Ve va diel eet etlice _ AOL ee that petit 


hadi tr 


Survey Methodology, December 2005 
Vol. 31, No. 2, pp. 161-168 
Statistics Canada, Catalogue No. 12-001-XPB 


161 


Does Weighting for Nonresponse Increase the 
Variance of Survey Means? 


Roderick J. Little and Sonya Vartivarian ' 


Abstract 


Nonresponse weighting is a common method for handling unit nonresponse in surveys. The method is aimed at reducing 
nonresponse bias, and it is often accompanied by an increase in variance. Hence, the efficacy of weighting adjustments is 
often seen as a bias-variance trade-off. This view is an oversimplification — nonresponse weighting can in fact lead to a 
reduction in variance as well as bias. A covariate for a weighting adjustment must have two characteristics to reduce 
nonresponse bias — it needs to be related to the probability of response, and it needs to be related to the survey outcome. If 
the latter is true, then weighting can reduce, not increase, sampling variance. A detailed analysis of bias and variance is 
provided in the setting of weighting for an estimate of a survey mean based on adjustment cells. The analysis suggests that 
the most important feature of variables for inclusion in weighting adjustments is that they are predictive of survey outcomes; 
prediction of the propensity to respond is a secondary, though useful, goal. Empirical estimates of root mean squared error 
for assessing when weighting is effective are proposed and evaluated in a simulation study. A simple composite estimator 
based on the empirical root mean squared error yields some gains over the weighted estimator in the simulations. 


Key Words: Missing data; Nonresponse adjustment; Sampling weights; Survey nonresponse. 


1. Introduction 


In most surveys, some individuals provide no infor- 
mation because of noncontact or refusal to respond (unit 
nonresponse). The most common method of adjustment for 
unit nonresponse is weighting, where respondents and 
nonrespondents are classified into adjustment cells based on 
covariate information known for all units in the sample, and 
a nonresponse weight is computed for cases in a cell 
proportional to the inverse of the response rate in the cell. 
These weights often multiply the sample weight, and the 
overall weight is normalized to sum to the number of 
respondents in the sample. A good overview of nonresponse 
weighting is Oh and Scheuren (1983). A related approach to 
nonresponse weighting is post-stratification (Holt and Smith 
1979), which applies when the distribution of the population 
over adjustment cells is available from external. sources, 
such as a Census. The weight is then proportional to the 
ratio of the population count in a cell to the number of 
respondents in that cell. 

Nonresponse weighting is primarily viewed as a device 
for reducing bias from unit nonresponse. This role of 
weighting is analogous to the role of sampling weights, and 
is related to the design unbiasedness property of the 
Horvitz-Thompson estimator of the total (Horvitz and 
Thompson 1952), which weights units by the inverse of 
their selection probabilities. Nonresponse weighting can be 
viewed as a natural extension of this idea, where included 
units are weighted by the inverse of their inclusion 


probabilities, estimated as the product of the probability of 
selection and the probability of response given selection; the 
inverse of the latter probability is the nonresponse weight. 
Modelers have argued that weighting for bias adjustment is 
not necessary for models where the weights are not 
associated with the survey outcomes, but in practice few are 
willing to make such a strong assumption. 

Sampling weights reduce bias at the expense of increased 
variance, if the outcome has a constant variance. Given the 
analogy of nonresponse weights with sampling weights, it 
seems plausible that nonresponse weighting also reduces 
bias at the expense of an increase in the variance of survey 
estimates. The idea of a bias-variance trade-off arises in 
discussions of nonresponse weighting adjustments (Kalton 
and Kasprzyk 1986, Kish 1992, Little, Lewitzky, Heeringa, 
Lepkowski and Kessler 1997). Kish (1992) presents a 
simple formula for the proportional increase in variance 
from weighting, say L, under the assumption that the 
variance of the observations is approximately constant: 


L=cv’, (1) 


where cv is the coefficient of variation of the respondent 
weights. 

Equation (1) is a good approximation when the 
adjustment cell variable is weakly associated with the 
survey outcome. However, since it approximates variance 
rather than mean squared error, it does not measure the 
potential nonresponse bias reduction that is the main 
objective of weighting, and it does not apply to outcomes 


1. Roderick J. Little, University of Michigan, U.S.A. E-mail: rlittle@umich.edu; Sonya Vartivarian, Mathematica Policy Research, Inc. 600 Maryland Ave 
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that are associated with the adjustment cell variable, where 
nonresponse weighting can in fact reduce the variance. The 
fact that nonresponse weighting can reduce variance is 
implicit in the formulae in Oh and Scheuren (1983), and is 
noted in Little (1986) when adjustment cells are created 
using predictive mean stratification. It is also seen in the 
related method of post-stratification for nonresponse 
adjustment (Holt and Smith 1979). 

Variability of the weights per se does not necessarily 
translate into estimates with high variance: an estimate with 
a high value of L can have a smaller variance than an 
estimate with a small value of L, as is shown in the 
simulations in section 3. Also, the situations where 
nonresponse weighting is most effective in reducing bias are 
precisely the situations where the weighting tends to reduce, 
not increase, variance, and Equation (1) does not apply. This 
differs from the case of sampling weights, and is related to 
“super-efficiency” that can result when weights are 
estimated from the sample rather than fixed constants; see, 
for example, Robins, Rotnitsky and Zhao (1994). 

We propose a simple refinement of Equation (1), namely 
Equation (14) below, that captures both bias and variance 
components whether or not the adjustment cell variable is 
associated with the outcome, and hence is a more accurate 
gauge of the value of weighting the estimates, and of 
alternative adjustment cell variables. In multipurpose 
surveys with many outcomes, the standard approach is to 
apply the same nonresponse weighting adjustment to all the 
variables, with the implicit assumption that the value of 
nonresponse bias reduction for some variables outweighs 
the potential variance increase for others. Our empirical 
estimate of mean squared error allows a simple refinement 
of this strategy, namely to restrict nonresponse weighting to 
the subset of variables for which nonresponse weighting 
reduces the estimated mean squared error. This composite 
strategy is assessed in the simulation study in section 3, and 
shows some gains over weighting all the outcomes. As 
noted in section 4, there are alternative approaches that have 
even better statistical properties, but these lead to different 
weights for each variable and hence are more cumbersome 
to implement and explain to survey users. 


2. Nonresponse Weighting Adjustments 
for a Mean 


Suppose a sample of n units is selected. We consider 
inference for the population mean of a survey variable Y 
subject to nonresponse. To keep things simple and focused 
on the nonresponse adjustment question, we assume that 
units are selected by simple random sampling. The points 
made here about nonresponse adjustments also apply in 
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general to complex designs, although the technical details 
become more complicated. 

We assume that respondents and nonrespondents can be 
classified into C adjustment cells based on a covariate X. Let 
M be a missing-data indicator taking the value 0 for 
respondents and 1 for nonrespondents. Let n,,. be the 
number of sampled individuals with M =m, X =c, 
m=0,1;c=1,...,C,n,. =m, +n, denote the number of 
sampled individuals in cell c,ny= XS, m, and n= 
yo, n,. the total number of respondents and non- 
respondents, and p, =n,./n, Po. =Np,/Mp the proportions 
of sampled and responding cases in cell c. We compare two 
estimates of the population mean wu of Y, the unweighted 
mean 


C 
Nite ~ PocYoe> (2) 
c= 


where y,. is the respondent mean in cell c, and the 
weighted mean 


@, c 
Dy => P.Yoc = 8 W. Poe Yoe> (3) 
c= c=l 


which weights respondents in cell c by the inverse of the 
response rate w.=p,./Po.. The estimator (3) can be 
viewed as a special case of a regression estimator, where 
missing values are imputed by the regression of Y on 
indicators for the adjustment cells. We compare the bias and 
mean squared error of (2) and (3) under the following 
model, which captures the important features of the 
problem. We suppose that conditional on the sample size n, 
the sampled cases have a multinomial distribution over the 
(C x2) contingency table based on the classification of M 
and X, with cell probabilities 


Pr(M =0, X =c)=6%o,3Pr(M =1,X =c)=d—0)n,, 


where 6=Pr(M=0) is the marginal probability of 
response. The conditional distribution of X given M =0 
and ny, is multinomial with cell probabilities Pr(X = 
cl|M =0)=7,,., and the marginal distribution of X given 
nis multinomial with index n and cell probabilities 


Pr(X =c)=0N,). + 1—-0)t,,. =7,, 


say. We assume that the conditional distribution of Y given 
M =m, X =c has mean u,,. and constant variance on 
The mean of Y for respondents and nonrespondents are 


Cc Cc 
Uo = >> To-Hoes Hy ni, TM. Uy.» 
c= cal 


respectively, and the overall mean of Y is U=OU,) + 
(1—)u). 

Under this model, the conditional mean and variance of 
y,, given {p,} are respectively 5°, p. Uy. and o° XS, 
p. /n.. Hence the bias of ¥,, is 
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G 
D(Y,,) rj ” Tl, (Uo. =H); 
c= 
where 7. and [. are the population proportion and mean 
of Y in cell c. This can be written as 


b(Y,,) =o —B, (4) 
where fl, = ©, 7, Uo, is the respondent mean “adjusted” 
for the covariates, and u = Y°, 1, U, is the true population 
mean of Y. The variance of y,, is the sum of the expected 


value of the conditional variance and the variance of its 
conditional expectation, and is approximately 


C 
V(¥,) =A+A)O7/ny + >) H,(Mo, — fly)” 1, (5) 
c=l 
where A =>, Ty, ((%,/ Mo, —1)”) is the population analog 
of the variance of the nonresponse weights {w.}, which is 
the same as L in Equation (1) since the weights are scaled to 
average to one. The formula for the variance of the weighted 
mean in Oh and Scheuren (1983), derived under the quasi- 
randomization perspective, reduces to (5) when the within- 
cell variance is assumed constant, and finite population 
corrections and terms of order 1/n* are ignored. The mean 
squared error of y,, is thus 


mse(¥,,) =b°(¥,,) + V(F,,)- (6) 
The mean squared error of the unweighted mean (2) is 

mse(F) =b* (Fo) + V(Fo), (7) 
where: 

B(¥o) = BY,.) + Uo — Ho: (8) 


is the bias and 


C 
V(¥9) =O / My + D) Tye (Hoe = Ho)” / Mo; (9) 
c=l 
is the variance. Hence the difference (say A) in mean 
squared errors is 


A=mse(y,)— mse(y,,) = B+V, —V,, where 
B = (Wy — fy)? + 2g fly) (Hy — B), 


Cc C 
v= ) Te (Hoe =[y)? / Ng -» Tl. (Mo. fly)” /n, 


c=l ig—t 


V, =A0° / ny (10) 


Equation (10) and its detailed interpretation provide the 
main results of the paper; note that positive terms in (10) 
favor the weighted estimator y,. 


(a) The first term B represents the impact on MSE of bias 
reduction from adjustment on the covariates. It is order 
one and increasingly dominates the MSE as the sample 
size mnereases, Ifpsfl,<, or i) <P sp, then 
weighting has reduced the bias of the respondent 


(b) 


(c) 


(d) 
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mean, and both of the components of B are positive. In 
particular, if the missing data are missing at random 
(Rubin 1976, Little and Rubin 2002), in the sense that 
respondents are a random sample of the sampled cases 
in each cell c, then ff, =p and weighting eliminates 
the bias of the unweighted mean. The bias adjustment 
is 
C 
Ho lp =D) Moe (1—w, (Mo. — Ho) 
c=] 

ignoring differences between the weights and their 
expectations. This is zero to O(1) if either non- 
response is unrelated to the adjustment cells (in which 
case w. ~1 for all c, or the outcome is unrelated to 
the adjustment cells (in which case [). ~ Up for all c). 
Thus a substantial bias reduction requires adjustment 
cell variables that are related both to nonresponse and 
to the outcome of interest, a fact that has been noted by 
several authors. It is often believed that conditioning 
on observed characteristics of nonrespondents will 
reduce bias, but note that this is not guaranteed; it is 
possible for the adjusted mean to be further on average 
from the true mean than the unadjusted mean, in which 
case weighting makes the bias worse. 


The effect of weighting on the variance is represented 
by = Vo. 


For outcomes Y that are unrelated to the adjustment 
cells, U).=H, for all c, V,=0, and weighting 
increases the variance, since V, is positive. The 
variance part of equation (10) then reduces to the 
population version of Kish’s formula (1). Adjustment 
cell variables that are good predictors of nonresponse 
hurt rather than help in this situation, since they 
increase the variance of the weights without any 
reduction in bias; but there is no bias-variance trade-off 
for these outcomes, since there is no bias reduction. 


If the adjustment cell variable X is unrelated to non- 
response, then A is O(1/n) and hence V, has a lower 
order of variability than V,. The term V, tends to be 
positive, since DE, %o. (Moc Ho) = Dea @oc (Moe 
fl, )’, and the divisor n in the second term is larger 
than the divisor ny in the first term. Thus weighting in 
this case tends to have no impact on the bias, but 
reduces variance to the extent that X is a good predictor 
of the outcome. This contradicts the notion that 
weighting increases variance. The above-mentioned 
“super-efficiency” that results from estimating non- 
response weights from the sample is seen by the fact 
that if the data are missing completely at random, then 
the “true” nonresponse weight is a constant for all 
responding units. Hence weighting by “true” weights 
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leads to (2), which is less efficient than weighting by 
the “estimated” weights, which leads to (3). 


(e) Ifthe adjustment cell variable is a good predictor of the 
outcome and also predictive of nonresponse, then V, 
is again small because of the reduced residual variance 
o°, and V, is generally positive by a similar argument 
to (d). The term D©, My, (Uo.—My)’ may deviate 
more from Y¢, 7, (Uo. —fl))’ because the weights 
are less alike, but this difference could be positive or 
negative, and the different divisors seem more likely to 
determine the sign and size of V,. Thus, weighting 
tends to reduce both bias and variance in this case. 


(f) Equation (9) can be applied to the case of post- 
stratification on population counts, by letting n 
represent the population size rather than the sample 
size. Assuming a large population, the second term in 
V, essentially vanishes, increasing the potential for 
variance reduction when the variables forming the 
post-strata are predictive of the outcome. This finding 
replicates previous results on post-stratification (Holt 
and Smith 1979; Little 1993). 


A simple qualitative summary of the results (a) — (f) of 
section 2 is shown in Table 1, which indicates the direction 
of bias and variance when the associations between the 
adjustment cells and the outcome and missing indicator are 
high or low. Clearly, weighting is only effective for 
outcomes that are associated with the adjustment cell 
variable, since otherwise it increases the variance with no 
compensating reduction in bias. For outcomes that are 
associated with the adjustment cell variable, weighting 
increases precision, and also reduces bias if the adjustment 
cell variable is related to nonresponse. 


Table 1 
Effect of Weighting Adjustments on Bias and Variance of 
a Mean, by Strength of Association of the Adjustment Cell 
Variables with Nonresponse and Outcome 


Association with outcome 


Association with nonresponse Low High 
Cell 1 Cell 3 
Low Bias: --- Bias: --- 
Var: --- Var: | 
Cell 2 Cell 4 
High Bias: --- Bias: | 
Var: ft Var: | 


It is useful to have estimates of the MSE of y, and y, 
that can be computed from the observed data. Let 55. = 
Dees Oe Vo)? /(ny. —1) denote the sample variance of 
respondents in cell c, s* =>&,(n), —1)s9, (mo. —C) the 
pooled within-cell variance, and 55 =>%%(y,—-Jy)?/ 
(n)—-1), the total sample variance of the respondent 
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values. We use the following approximately unbiased 
expressions, under the assumption that the data are MAR: 


mée(¥,) = B? (Fy) + V (Fp), (11) 


where V(¥,)) = 50 / mp and 
B? (y,) = max{0,(¥,, — Yo)” -—Va} 


(G 
3 Pic Voc — yy’) In, 
c=1 


ie 
Vi, =M /n)” +) Poc Voc — Yo) /ny |, (12) 


c= 


G 
+57 > (Pic ia Poc) (No. 


c=l 


where y\” =X%, p,.¥o.. and V, estimates the variance of 
(Y,,— Yo) and is included in (12) as a bias adjustment for 
(¥,,— Yo)’ as an estimate of B*(¥,), similar to that in 
Little et al. (1997). Also 


@ 
me(y,,)=V(¥,, =U+L)s*/ng +>) Pe (Yoo — Vw) /n. (13) 
c=] 
Subtracting (11) from (13), the difference in MSE’s of jy, 
and y, is then estimated by 


D=Ls* I) —(s3 —s”)/ Ng 


C 
+>) Pe(Foo Fy) /n—- BF). (14) 
c=l 
This is our proposed refinement of (1), which is represented 
by the leading term on the right side of (14). 


3. Simulation Study 


We include simulations to illustrate the bias and variance 
of the weighted and unweighted mean for sets of parameters 
representing each cell in Table 1. We also compare the 
analytic MSE approximations in Equations (6) and (7) and 
their sample-based estimates (11) and (13) with the 
empirical MSE over repeated samples. 


3.1 Superpopulation Parameters 


The simulation set-up for the joint distribution of X and 
M is described in Table 2. The sample is approximately 
uniformly distributed across the adjustment cell variable X, 
which has C =10 cells. Two marginal response rates are 
chosen, 70%, corresponding to a typical survey value, and 
52%, a more extreme value to accentuate differences in 
methods. Three distributions of M given X are simulated to 
model high, medium and low association. 

The simulated distributions of the outcome Y given 
M =m, X =c are shown in Table 3. These all have the 
form 


[Y1M =m, X =c]~ N(B, +B, X, 6°). 
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Table 2 
Percent of Sample Cases in Adjustment Cell X and Missingness Cell M 


a. Overall Response Rate = 52% 


7.04 5 8.07 8.59 O14 9.64 9.96 


Association X 1 72 3 
Between 
M and X 
ile High M=0 0.55 1.00 4.01 
M=1 8.69 9.00 6.01 
Dy, Medium M=0 2a 3.50 4.01 
M=1 6.47 6.50 6.01 
By Low M=0 4.62 als 5 2 
M=1 4.62 4.85 4.81 
b. Overall Response Rate = 70% 
Association xX 1 2 3 
Between 
Mand xX 
iT. High M=0 0.55 3.00 6.51 
M=1 8.69 7.00 eyo! 
We Medium M=0 4.44 5.30 5.81 
M=1 4.80 4.70 4.21 
3 Low M=0 6.19 6.85 6.91 
M=1 3.05 Ba 3 iui 


6.98 7.05 7.11 WAT 7.24 V3t 737 


j SF 3.02 2.98 2.93 2.88 2.84 219 


Table 3 
Parameters for [Y|M =m, X =c] ~ N(By +B, ¢, 0’) 
Association Between By o 9° 
Y and X 
le High 4.75 46 = 0.80 
2, Medium 3.70 122; = 0.48 
3. Low 0.00 234 0.00 


Three sets of values of (B,,6°) are simulated to model 
high, medium and low associations between Y and X. The 
intercept B, is chosen so that the overall mean of Y is 
uw = 26.3625 for each scenario. 

A thousand replicate samples of size n = 400 and n = 
2,000 were simulated for each combination of parameters in 
Tables 2 and 3. Samples where n).=0 for any c were 
excluded, since the weighted estimate cannot be computed; 
in practice some cells would probably be pooled in such 
cases. The numbers of excluded simulations are shown in 
Table 4. 


Table 4 
Numbers of Replicates Excluded Because of Cell 
with no Respondents 


Response Rate 


Association of Associationof 52% 10% 
M and X Y and X 
High High 134 is 
Medium 120 LEGS 
Low 131 104 
Medium Low 1 0 


3.2 Comparisons of Bias, Variance and Root Mean 
Squared Error, and their Estimates 


Summaries of empirical bias and root MSE’s (RMSE’s) 
are reported in Table 5. The empirical RMSE’s of the 
weighted mean can be compared with the following esti- 
mates, which are displayed in Table 5, averaged over the 
1,000 replicates: The estimated RMSE based on Kish’s rule 
of thumb Equation (1), namely: 


msey (7 Y= (hh) sZ-hny; 


where s; yy (y; — Vo)? /(my — D3 (15) 
i=l 
The analytical RMSE from Equations (6) and (7); and the 
estimated RMSE from Equations (11) and (13). 

Following the suggestion of Oh and Scheuren (1983), we 
include in the last two columns of Table 5 the average 
empirical bias and RMSE of a composite mean that chooses 
between y, and y,, picking the estimate with a lower 
sample-based estimate of the MSE. The empirical bias 
relative to the population parameter is reported for all 
estimators. We also include the bias and RMSE of the mean 
before deletion of cases due to nonresponse. 

Table 5a shows results for simulations with a response 
rate of 52%. Rows are labeled according to the four cells in 
Table 1, with medium and high associations combined. For 
each row, the lower of the RMSE’s for the unweighted and 
weighted respondent means is bolded, indicating superiority 
for the corresponding method. 

The first four rows of Table 5a correspond to cell 4 in 
Table 1, with medium/high association between Y and X and 
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mediun/high association between M and X. In these cases 
y,, has much lower RMSE than y,, reflecting substantial 
bias of y, that is removed by the weighting. 

The next two rows of Table 5a corresponding to cell 3 of 
Table 1, with medium/high association between Y and X and 
low association between M and X. In these cases y, is no 
longer seriously biased, but y,, has improved precision, 
particularly when the association of Y and X is high. These 
are cases where the variance is reduced, not increased, by 
weighting. The analytic estimates of RMSE and sample- 
based estimates are close to the empirical RMSE estimates, 
while Kish’s rule of thumb overestimates the RMSE, as 
predicted by the theory in section 2. 

The next two rows of Table 5a correspond to cell 2 of 
Table 1, where the association between Y and X is low and 
the association between M and X is medium or high. In 
these cases, y, has higher MSE than jy). These cases 
illustrate situations where the weighting increases variance, 
with no compensating reduction in bias. The last row 
corresponds to cell 1 of Table 1, with low associations 
between M and X and between Y and X. The unweighted 
mean has lower RMSE in these cases, but the increase in 
RMSE from weighting is negligible. For the last three rows 
of Table 5a, RMSE’s from Kish’s rule of thumb are similar 
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to those from the analytical formula in section 2 and 
empirical estimates based on this formulae, and all these 
estimates are close to the empirical RMSE. 

The last two columns of Table 5a show empirical bias 
and RMSE of the composite method that chooses y,, or Y, 
based on the estimated RMSE. For the simulations in the 
first 6 rows, the composite estimator is the same as y,,, and 
hence detects and removes the bias of the unweighted mean. 
For simulations in cell 1 (the last row) the composite 
estimator performs like y, or Y), as expected since j,, 
and y, perform similarly in this case. For simulations in 
cell 2 that are not favorable to weighting, the composite 
estimator has lower RMSE than jy, but considerably 
higher than that of y,, suggesting that for the conditions of 
this simulation the empirical MSE affords limited ability to 
pick the better estimator in individual samples. 

Nevertheless, the composite estimator is the best overall 
estimator of the three considered in this simulation. 

Table 5b shows results for the 70% response rate. The 
pattern of results is very similar to that of Table 5a. As 
expected, differences between the methods are smaller, 
although they remain substantial in many rows of the table. 


Table 5a 
Summaries of Estimators Based on 1,000 Replicate Samples for C = 10 Adjustment Cells, Restricted to Sample 
Replicates with np, >0 for all c. Response Rate of 52%. Values are Multiplied by 1,000 


Association with Adjustment Unweighted Weighted Before Deletion Composite 
Cells Based on X Mean Mean Mean Mean 

Cell (M, X) (Y, X) n emp. emp. analytical est. emp. emp. Kish analytical est. emp. emp. emp. emp. 
bias rmse mse’ rmse bias mse rmse’ rmmse’ rmse bias rmse bias —_ rmse 
tf High High 400 6,955 7,024 7,055 6,974 0 1,057 1,410 956-4988 -38 795 Os 057, 
2,000 7,008 7,020 7,006 7,015 -2 424 608 427 434 (2 34), —2 424 
4 High Medium 400 5,376 5,471 5,536 5,404 — 33) 12641 S10 S12168 1,297, -21 776 -33 1,264 
2,000 5,424 5,441 5,466 5,466 -41 561 650 45a 559 —30 338 —4] 561 
4 Medium High 400 3,664 3,794 3,809 3,754 -4 816 1,071 835 = 842 6 741 + 816 
20008 357038 35/3 iS 00M 35712 T3069 473 373-374 A OST 7 369 
+ Medium Medium 400 2,838 3,006 3,042 2,991 -18 938 1,095 954. 970 -9 747 -18 938 
2,000 2,864 2,900 2,898 2,893 -2 426 483 426 428 GMWES35 —2 426 
3 Low _ High 400 476 1,148 1,113 1,178 40 823 1,050 823 = 828 30 = 764 40 823 
2,000 376 587 614 595 -11 361 465 368 = 368 =O 35 -11 361 
3 Low Medium 400 350 1,106 1,095 1,134 1827 068 O25 39 -16 762 13 927 
Z0008) 287 GRSG65 DOSEN O59) -20 429 470 413 414 —22 353 -20 429 
2 High Low(0) 400 56 1,070 1,056 1,275 9697 1,658 TC GISM 1518 F a63l 2393 83 1,410 
2,000 -11 464 473-567 -26 698 698 GIO ua. 699 = 19.55 1337 —25 620 
2 Medium Low(0) 400 9 1,042 1,053 1,077 =27 1122 1112. 1,097 e125 Dihey ll -12 1,074 
2,000 -4 474 471 480 -11 491 491 491 493 11 340 -9 481 
] Low Low) 400 -30 1,038 1,050 1,055 -30 1,053 1,064 1,050 1,076 =30 » 752 -30 1,040 
2,000 —2. 472 469 469 -l1 474 +470 469471 -8 343 -1 472 

' Computed using Equation (7) 


* Computed using Equation (11) 
* Computed using Equation (15) 
* Computed using Equation (6) 

> Computed using Equation (13) 
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Table 5b 
Summaries of Estimators based on 1,000 Replicate Samples for C = 10 Adjustment Cells, Restricted to Sample 
Replicates with ng, >0 for all c. Response Rate of 70%. Values are Multiplied by 1,000 


Association with Adjustment Unweighted 
Cells based on X Mean 

Cell (M, X)_ (Y, X) n emp. emp. analytical est. 
bias rmse  rmse° mse’ 

4 High High 400 4,692 4,810 4,893 4,860 
2,000 4,827 4,841 4,839 4,854 

4 High Medium 400 3,581 3,716 3,855 3,733 
2.0005 3,763) 3.5784 B37 8-9S.7407 

4 Medium High 400 2,666 2,812 2,878 2,837 
SACO OE C2I EPC LS PAUSE PL IKOS| 

4 Medium Medium 400 2,104 2,282 2,315 2,291 
2,000 2,146 2,180 2,170 2,165 

5 Low High 400 217 906 954 980 
ROOT? Biz sis 506 = =502 

3 Low Medium 400 251 922 942 960 
2,000 DA ASS AT, rag 

2 High Low(0) 400 Om 952 Ot Sa els 
2,000 -11 416 409 485 

2 Medium Low(0) 400 22 #8911 910 920 
2,000 23 418 407 411 

1 Low Low(0) 400 if 914 914... .912 
2,000 4 402 408 408 


Weighted Before Deletion Composite 
Mean Mean Mean 

emp. Kish analytical est. emp. emp. emp. emp. 
rmse rmse® mse’ rmse” bias rmse bias —_s rmse 
1,129 1,192 889 894 -129 998 —133 1,129 
400 39529 398 405 -S 334 —20 400 
1,266 1,250 1,075 1,097 -128 917 -127 1,284 
501 = 554 481 490 ll = 343 -9 501 
803. 910 794 796 -49 772 —58 803 
353 406 BB) § 2Be) -9 333 —6 353 
833-924 854 =. 861 -43 751 —28 833 
370—Ss 411 Sie, She) 10 9334 13 370 
Seo el 790 = 793 =H TT -81 797 
365 = 405 B53) Bays) 4 349 2 365 
804 916 845 852 20 a2, 15 804 
370 = 408 Shey: STS) la \ Spa -14 370 
1,445 1,349 1,298 1,358 Le sO7, 20m 292 
608 598 580 599 -4 347 31 535 
942 936 930 946 2 TS 21 925 
425 416 416 417 15 344 19 420 
NP ike 914 = 926 -5 751 1 914 
403.409 408 410 Oy meso 4 402 


° Computed using Equation (7) 
” Computed using Equation (11) 
* Computed using Equation (15) 
° Computed using Equation (6) 
'° Computed using Equation (13) 


4. Discussion 


The results in sections 2 and 3 have important 
implications for the use of weighting as an adjustment tool 
for unit nonresponse. Surveys often have many outcome 
variables, and the same weights are usually applied to all 
these outcomes. The analysis of section 2 and simulations in 
section 3 suggests that improved results might be obtained 
by estimating the MSE of the weighted and unweighted 
mean and confining weighting to cases where this 
relationship is substantial. A more sophisticated approach is 
to apply random-effects models to shrink the weights, with 
more shrinkage for outcomes that are not strongly related to 
the covariates (e.g., Elliott and Little 2000). A flexible 
alternative to this approach is imputation based on 
prediction models, since these models allow for interval- 
scaled as well as categorical predictors, and allow 
interactions to be dropped to incorporate more main effects. 
Multiple imputation (Rubin 1987) can be used to propagate 
uncertainty. 

When there is substantial covariate information, one 
attractive approach to generalizing weighting class adjust- 
ments is to create a propensity score for each respondent 
based on a logistic regression of the nonresponse indicator 
on the covariates, and then create adjustment cells based on 
this score. Propensity score methods were originally 


developed in the context of matching cases and controls in 
observational studies (Rosenbaum and Rubin 1983), but are 
now quite commonly applied in the setting of unit 
nonresponse (Little 1986; Czajka, Hirabayashi, Little and 
Rubin 1987; Ezzati and Khare 1992). The analysis here 
suggests that for this approach to be productive, the 
propensity score has to be predictive of the outcomes. 
Vartivarian and Little (2002) consider adjustment cells 
based on joint classification by the response propensity and 
summary predictors of the outcomes, to exploit residual 
associations between the covariates and the outcome after 
adjusting for the propensity score. The requirement that 
adjustment cell variables predict the outcomes lends support 
to this approach. 

The analysis presented here might be extended in a 
number of ways. Second order terms in the variance are 
ignored here, which if included would penalize weighting 
adjustments based on a large number of small adjustment 
cells. Finite population corrections could be included, 
although it seems unlikely that they would affect the main 
conclusions. It would be of interest to see to what extent the 
results can be generalized to complex sample designs 
involving clustering and stratification. Also, careful analysis 
of the bias and variance implications of nonresponse 
weighting on statistics other than means, such as subclass 
means or regression coefficients, would be worthwhile. We 
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expect it to be important that adjustment cell variables 
predict the outcome in many of these analyses too, but other 
points of interest may emerge. 
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Variance-Covariance Functions for Domain Means of 
Ordinal Survey Items 


Alistair James O’ Malley and Alan Mark Zaslavsky ' 


Abstract 


Estimates of a sampling variance-covariance matrix are required in many statistical analyses, particularly for multilevel 
analysis. In univariate problems, functions relating the variance to the mean have been used to obtain variance estimates, 
pooling information across units or variables. We present variance and correlation functions for multivariate means of 
ordinal survey items, both for complete data and for data with structured non-response. Methods are also developed for 
assessing model fit, and for computing composite estimators that combine direct and model-based predictions. Survey data 
from the Consumer Assessments of Health Plans Study (CAHPS°) illustrate the application of the methodology. 


Key Words: Variance function; Correlation function; Hierarchical model; Ordinal response; Nonresponse; Skip 


pattern. 


1. Introduction 


Survey data are often used to obtain measures for 
comparisons across estimation domains. In our motivating 
example, surveys are conducted to elicit reports on 
experiences with health plans (entities administering health 
care) from enrolled members; similarly a survey might 
assess schools by administering tests to a sample of 
students. 

An essential part of the analysis of survey data is the 
calculation of sampling variances, or the sampling- 
covariance matrix of a multivariate estimator. The standard 
survey sampling approach is to compute variances directly 
for each estimator in each domain. Direct variance estimates 
may be unstable when the number of respondents to an item 
is small because the sample size for a domain is small, 
because the item is applicable to only a fraction of 
respondents (such as users of specialized equipment in 
health surveys), or because we are interested in means for a 
small subgroup (such as those with chronic illnesses). 

By modeling variance estimates as functions of the unit 
(domain) means, we can pool information across units to 
obtain more stable estimates. Although modeling may 
introduce bias, for small units this is offset by the reduction 
in sampling variation. One may also consider generalizing 
variance estimates across items in addition to or instead of 
domains. This will be appropriate when there are groups of 
items for which the same mean-variance relationship is 
likely to hold. However, when there are many more 
domains than items, the greatest potential gain is from 
generalizing across domains rather than across items. 

A Generalized Variance Function (GVF) is a 
mathematical model describing the relationship between the 


variance or relative variance of a survey estimator and its 
expectation. When multiple estimates are produced from the 
same sample, Wolter (1985, chapter 5) proposes the model 


V/M* =0,+0,/M, 


where M and V denote the expected value and variance of 
the estimator respectively. Such a form might be suitable for 
variables such as income or wealth for which a nearly 
constant coefficient of variation might be plausible because 
the mean and standard deviation are proportional to the 
length of the reference period. Modeling the coefficient of 
variation is thus most suited to situations where the 
variables are similar in content but have different scales 
with unrestricted ranges (e.g., income collected monthly and 
yearly). In our problem the items are ordinal and so a model 
of the coefficient of variation is not a natural choice. Other 
proposed GVFs also have simple forms (Woodruff 1992; 
Otto and Bell 1995). 

If a suitable GVF can be found, it can simplify 
calculations and make variance estimates more stable. 
Furthermore, summarizing sampling variance estimates in 
the form of a function also facilitates presentation of large 
volumes of statistics (Wolter 1985, pages 201-202). Finally, 
modeling variances as functions of means facilitates 
iterative re-estimation of sampling variances in hierarchical 
modeling. In practice the decision to use variance functions 
in a hierarchical modeling context depends on the goodness 
of the fit of the GVF; only with a sufficiently good fit is use 
of the GVF worthwhile. 

Past work on GVFs is relatively sparse. Wolter (1985, 
chapter 5) gave an overview but provided only a few 
references, as did Valliant, Dorfman and Royall (2000, 
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pages 344-348). Valliant (1992a, 1992b) used GVFs to 
smooth time-dependent indices in time series analysis. 
Woodruff (1992) used GVFs for variance estimation of 
employment change in the Current Employment Survey, 
and Wolter (1985, pages 208-217) illustrates the use of 
GVFs on data from the Current Population Survey. GVFs 
are also used in the National Health Interview Survey 
(Valliant et al. 2000, page 344). 

Huff, Eltinge, and Gershunskaya (2002) and Cho, 
Eltinge, Gershunskaya and Huff (2002) considered GVFs 
for the United States Current Employment Survey and 
Consumer Expenditure Survey. Eltinge (2002) uses GVFs 
to estimate a full sampling covariance matrix when samples 
are too small to produce stable estimates for all areas, 
estimating the components of the mean squared error (MSE) 
of the GVF model. Otto and Bell (1995) fit GVFs to median 
income, per capita income, and age-group poverty rates in 
the Current Population Survey, assuming an autoregressive 
dependence between rates over time and a Wishart 
distribution for the sampling covariance matrices. 

Our research extends previous research on GVFs in four 
directions. First, we use the GVEF to generalize across 
domains rather than items. Thus, we do not assume that 
different items have the same GVF, although it might be 
reasonable to fit models of the same form for items with 
similar response categories. Second, we develop GVFs for 
the full covariance matrix, which must be estimated for joint 
inference on multiple outcomes. Thirdly, we focus on the 
relationship between means and variances of items with the 
ordinal response formats often used in survey question- 
naires, rather than on homoscedastic continuous responses. 
Finally, we explicitly allow for patterns of nonresponse due 
to structured skip patterns. While structured item non- 
response can be ignored (except for its effect on sample 
size) in univariate estimation, it must be considered 
explicitly to model bivariate relationships because it affects 
the sampling covariance of item means. Furthermore, 
because the number of responses varies across items, we 
cannot model the sampling covariances using a Wishart 
distribution, which has only a single parameter for sample 
size. 

We first describe direct estimation of variances and 
covariances, including the case when data are missing due 
to skip patterns. In section 3 we introduce models for 
generalized variance and covariance functions (GVCEFs) and 
lay out our strategies for model fitting and evaluation and 
for combining direct estimates and model predictions. In 
section 4, we apply our methods to a major health care 
survey. In section 5, we conclude by describing applications 
and extensions of our methods. 
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2. Direct Estimates of Sampling Variances of 
Domain Means 


We index observations by domain h, items (indices i and 
j ), and respondents (indices k and/); y, , and 7, ,, denote 
the outcome and response indicator of subject k in domain h 
on item i. We suppress the index for item when referring to 
all items for a respondent or domain, and have no need for 
the subscript for respondent when discussing the means, 
variances, and correlations of items. 

Direct estimation of the sampling covariance matrix of 
domain means (henceforth, “variance estimation”) begins 
by expressing the means as functions of totals of the 
outcomes and response indicators. We replace y, , with 0 
for missing observations so that totals are defined in the 
presence of skip patterns. Following the notation of Sarndal, 
Swenson and Wretman (1992, pages 24-28; 36-42), let U, 
and S, describe the population and sample respectively for 
the h™ domain, | 9 he eae a he oe ESSN gee Sh 2 ve = 
Xs,Ynik> Md Ry, = Ls, Tine Where Vain =Yain Tne 
Tr ik =Tr,in (Typ, and T,, =pr(ke S,). 

The vector of mean outcomes for the population of 
elements within domain h is 


Yaa Yi 
M Nag. (1, dey | 5 525 
Rhy R71 


where Y, =(Y,),-. 


estimator 1s 
wT Y Y, 
SY on thp.) Ge op] 


h,1 Ry, 7 


YY) wand Res (Rove. upreamar Nn 


A first order Taylor series expansion of f (Y,,R,) about 
f(%,.R),) produces the approximation 


var f (Y,, &,)),= V, = f Wy, Ry) vary, Rit Geeks). 


where f’(Y,,R,) is the Jacobian of f(Y,,R,). Often it is 
computationally easier to first calculate u,,= 
f'(%,>Ry)Zn,4» Where z, , =(¥,,4>% 4)» and then evaluate 
the variance as 


= i ipa it,, poe 
k,leU,, 
where J, , =1 if ke S,, (indicating that the k'" member of 
domain fh is sampled) and O otherwise, A, ,, = 
Tr gy ~My, M, , and T, , =pr(k,Je S,). An estimator 
for V,, is 
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- +4 ts: 
Vee >t An kt Uh Un (1) 
k,leS, 
where A, ,, =Aj ¢:/T%a, x1: % 

To describe evaluation of V, one need only consider one 
diagonal (i.e., variance) element and one off-diagonal (i.e., 
covariance) element. The sub-matrix of the Jacobian formed 
by the i" and j™ items is given by 


Bf, 
R 
, h,i h,i 
ifs (Y,,; R,) a 
1 Y, 
0 Oh See 
Li Rij 
For, GA (Vn. ik Vn, jen, io", jk ), it follows that 
pe (Vile eM pg Tura) 
Une =f Vy Ri) Zee = i , 
R (Yn, jk uy ok Th, jk) 
h, j 
where M, ; =Y, ;/R,,; 18 the mean outcome of the i” item 
strabhiain he Hence, 
is Di Ags ci Vnik ~ M i Tn.ik Vp. —M,, tha) @ 
oa i k,leS, 
and 
“= Duly, a Vain ~ Mii tik) 
ere . 
X(Vn, jt — Ma, i I, ji): (3) 


To evaluate (2) and (3), we make a further approximation 
by substituting R,,=Ds%% and M,;=Ls, ina! 
(Xs, %,,ixn) for R,; and M, ;. 

When sampling rates are small, or if we wish to make 
predictions for a large super-population (e.g., all potential 
enrollees in a health plan, not just those currently enrolled), 
atone = eit oka) Ne =O bite es lorands the 
sampling design approaches sampling with replacement. 
Under the sampling with replacement design, approximately 
unbiased estimators are 


Vii = FD PES; (Vi ee M,, iT, ay (4) 
R;, keS, 
and 
a 


Ds Vp, ie 


M ii Fic Wn, ke - Mes Taszed (5) 
ee h,j keS, 


These estimators can be generalized to accommodate 
clustering. 
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With equal-probability sampling within domains, (4) and 
(5) reduce to 


A ] a 
Vii = > Ds. (Vnik —M a. Tue (6) 
Rs keS,, 
and 
ai ‘Sa 
PR. » (Vngik ~ Mia i Mi On, jk -M,,; Th, jk)» (7) 
aa Spj KESh 
where Re is the number of respondents to item 7 in 
domain h. 


3. Models for Variance Functions 


In this section we propose specifications for models for 
variances and for sample correlations with complete 
responses or with structured skipped responses. We then 
discuss model fitting and evaluation strategies. We assume 
that these domains are nonoverlapping strata, so the 
sampling errors for different domains are independent. 

We transform the ordinal ratings to the [0,1] interval by 
the transformation p, ; =(B, ;—M),,;)/(B),,; — A),;), where 
A,,; and B,; are the minimum and maximum response 
categories for item i in domain h respectively. We focus on 
modeling variances for large values of M, ; (small values 
of p,,; ) because in our motivating example mean outcomes 
are typically near the high end of the scale. 


3.1 Variance Functions 


To account for the variable number of respondents over 
domains and items, and differing scales, we normalize the 
variance estimators in (6) for sample size and re-scale: 


~ Rg Vi ii 
Vi ii = B LY 
( pt hid 


With unequal probability sampling within domains, a 
normalization factor could be used that accounts for the 
aati One ‘saa normalization is to multiply V, ;, by 

=F, Rye i, >)» Where Tr, ix 4S the response 
cis for item i for the k" subject in the h™ domain, in 
place of ee This approximation, proposed in Kish 
(1965), has a model based justification (Gabler, Haeder and 
Lahiri 1999). It works well if the sampling probabilities 
vary modestly in the sample, but can lead to inefficiency if 
the variation is excessive (Korn and Graubard 1999, page 
173; Spencer 2000). 

Because the items in our example have ordinal scales, the 
variance must go to0 as p, ; 0 or p, ; 1. An obvious 
predictor with this property is the variance function of the 
Bernoulli distribution, p,, ;(1—p,,;). This holds exactly for 
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dichotomous items, and might be a useful approximation for 
items with three or more categories. 

As alternatives to the Bernoulli variance model we 
considered models with a variety of polynomial and other 
functions of the means as predictors. Of all the models 
considered, the quadratic family of models were found to fit 
as well as any. We focused on the following quadratic 
models. 


Model V1: V,, ; =BuPp.i> (8) 
Model V2°°V,.=B3)pi i py) (9) 
Model V3: Vi ii = Bi Pri +B. p,(1— Pri) (10) 


Thus we consider a linear variance model V1, a binomial- 
like model V2, and a general quadratic variance model V3. 
All models correctly ensure V,, ;, =9 when p, , =0, but 
only V2 ensures that Mees =0 when p, ; =1. The rationale 
behind V1 is that relationships are often approximately 
linear over small intervals. Both V1 and V2 are submodels 
of the two-parameter quadratic V3. We also considered 
models for log(V,, ,,)> but these models did not fit as well. 

The model V3 is equivalent to the model suggested by 
Wolter (1985, chapter 5); the equivalence is seen by 
expressing the right-hand side of V3 in terms of p, , and 
p;,;, and then dividing both sides by p,, to obtain the 
relative variance. However, parameter estimates obtained by 
fitting the two forms of the model may be different 
depending on the modeling assumptions used. 


3.2 Correlation Functions with Complete Data 


Because correlations are independent of the scale of the 
data, we model the correlations and derive the sampling 
covariances, rather than modeling the covariances directly. 
We model the sample correlations 


A 


V, 


h, ij 


ie om 


Vv at ere 

via the unrestricted transformed values Z, = 
log{(+,,;)/A—-6,,,)}- Unlike the variance models, 
models for correlations may include an_ unrestricted 
intercept, since there is no natural restriction on the 
correlation when p, ; or p,,; approaches 0 or 1. 

Because ,,, is a function of the first and second 
moments of items i and j, it seemed reasonable to first focus 
on linear and quadratic models for Z, ;,. As with variance 
functions, we found that a more extensive range of models 
(e.g., models with logarithms of the means as predictors) did 
not substantially improve model fit. We ultimately focused 
on the following nested series of models. 
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Model Cl: Z, ;, =Q;, (11) 
Model C2: 0Z, = Qos + Ose Pag dae (12) 
Model C3: Zp, i = Loi +05; (DP, + a) 


+035, Pai Pa, j> (13) 


Model C4: Z,, |; = Qo; ; + Oi; Pp, ; + %2ij Pp, j 
+03; ¢ Pas Pry? (14) 


Model C5: Z, 5, = Qo; + O41; Pai + 24 Pr, j 


+3; ; Pa, i Pa, j + Cai i, i+ Os; ‘gee (15) 
Model C3 is model C4 with the constraint 04,,, = 0,,;;. 


3.3. Predicting Covariances with Structured 
Missing Data 


When the data have skip patterns, the sample correlations 
of the ratings for the set of respondents who answered both 
items can be modeled by (11)-(15), as in the complete 
response case. The corresponding sample covariances can 
be easily estimated by using the fitted variance functions to 
re-scale the predicted correlations. However, because the 
sampling covariance reflects the variability in the whole 
sampling process, not just the variability within the sub- 
population of respondents who answered both items, the 
relationship between sample covariance and sampling 
covariance is more complicated than if the data were 
complete. In this section we derive the relationship between 
the sample covariance for the set of respondents who 
answered both items and the sampling covariance. This 
allows correlation models such as (11)—(15) to be applied to 
data with skip patterns. 

There are four distinct data patterns for any pair of items: 
response to both items, one response and one skipped item 
(two patterns), and both items skipped. We extend our 
notation by introducing a superscript representing the 
response pape of a second item. Let Apr i 
A Vn,ik™h, jk ye ij = Ls, Yair th, ja) an ij 


=e Th. ieln, jk? 
Rea = =s, hx U- Th, in)» Mate IR! 7°. =V¥°./R° 
Then 


h,ij h,ij? hij h,ij h,ij* 


YW TR eet Ree 


h, ij h,ij h,ij 
Ro, ae ‘ 


Re 


In the equal probability sampling case, substitution of the 
above expression for M ,,; unto (7) yields 


SS ROD) ee ie 
aha [eta ha wie uh (16) 
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where Die ; =M jy - M?- Here Gh xe Ls Vai — 
Mi. Tix Wn, ik ~ Mi. ii Th, eR i the normalized 
sample covariance of the ratings for in set of respondents 
who answered both items (which can be predicted using 
correlation and variance functions, and in the case of 
unequal probability sampling applying a normalization 
factor). When the sampling probabilities are not equal, 
Equation (16) holds exactly only if Ysh,% 
(Va, ix —M hit , ix) = 0. Therefore, (16) may be expected to 
provide a good approximation if the sampling probabilities 
for one item are not highly correlated with the residuals for 
another item. In general, the appropriateness of using (16) 
for unequal probability sampling designs should be 
checked. 

The estimated mean differences Dig determine the 
contribution of the response pattern a the sampling 
covariance. Either D,, i OF Dy i D,, ;, May be modeled in 
the process of obtaining smoothed estimates of ies In our 
application, the D,, y, were typically small. ae the 
second term of (16) is a product of two factors of ae 
magnitude Pox and D. ji)» the contribution of D 
(16) was small and it sufficed to use a simple model - 
= such as a constant for each item pair. However, 
unique constants should be estimated for each pair of items. 


3.4 Model Fitting and Evaluation 


h,ij 


We estimate the parameters of the variance or correlation 
function using iteratively reweighted least squares 
regression. Weighting is important when the number of 
responses varies greatly across domains, as in our 
motivating example. 

In this section we index domains (h) and respondents (k) 
but not items as the same methodology applies to each 
variance and correlation model. Exact computations are 
derived for the equal probability sampling case, and 
approximations are noted for the unequal probability 
sampling case. Generically, the direct estimators re true 
values f,, and model predictions te are related through 
the hierarchical model 


Revell: “f2 f, Hey (17) 
level: f, =f, +6, (18) 


where €, ~ [0, on IR, ye é, ~ [0,7 *1, and [U, 0 a medic aiee 
a distribution with expectation u and variance o7 but un- 
specified | form. In the unequal probability sampling case we 
replace Rs with R, . Here €, represents sampling error 
and e, represents fiodet error. Marginally, {i= fice 

e,+€, so in the regression we weight the observation for 
domain h by w, =(0+0;/R)", the inverse of the 
marginal variance. With equal-probability sampling, the 
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variance of the direct estimate of 6; = E[ fee f,)° is given 
by 
SOW Eon 
O;, cp) 75 
1 1 ~ Cie ss 
- As Dail -M 1.) -d-s K 
R, “te pa Yhyk h"h,k R, 
if f is a variance (19) 


c. Ga, = é if f is a transformed correlation. (20) 
Sh 

In the equal probability sampling case Equation (19) is exact 
and does not depend on parametric assumptions (Seber 
1977, page 14). The asymptotic approximation (20) to the 
variance of the transformed correlation Z, (Freund and 
Walpole 1987, page 477) deteriorates as sample sizes 
decrease, and fails altogether for Ry <3. However, 
domains with small sample sizes have little impact on the 
fitted models; we exclude domains with i <5. from 
correlation modeling. 

When the sampling probabilities are not equal, the large 
sample counterpart to (19), given by 


u Ae AY 
(Yani ne) tt 2w, 
=) - 
i /, 
oF pos A,l D5: hyl 
Or Cie a: ~ 9 ; 
keS - es fi —2 
X(Vnn-Mitid— y 


> —2 h,k 
le S$ Trl 


where w, =(Xs ¥n1%)/Ds %,)-M,,, may be used. In 
the equal probability sampling case, w, =O and the above 
expression reduces to a non-bias corrected version of (19). 
If the sampling probabilities are not equal, we suggest 
replacing (20) with the design-effect-corrected estimator 
~ 4 
6;, UN aaes ee 
d= 

The model error variance 7” is estimated as: 


DRO? (fF) 
Piel iaeals 


where MSE=)),9,(V,- fi)» 4, =N Rs, /Xn Rs, and 
i >, 1 (Ry >0). The weights are then re-estimated as 

=(t? + 62F,)/ Rs. )', and the GVCF models are refit, 
icra to convergence. We again suggest replacing Ry 
with R, s, if the sampling probabilities are not equal. 

We compared the predictive accuracy of models using 
R*® =1—MSE/MSV, where MSE is the mean squared 
error of the regression, and MSV is the sample size 
weighted average of the sampling variances of the direct 
estimators (variances or transformed correlations) for each 


R? = ax. MSE — 
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domain. Note that we could have R* <0 for a very poorly 
fitting model. 


3.5 Combined Estimators 


For domains with small samples, direct survey variance 
estimates often are too imprecise to be useful, while 
estimates for larger domains in the same study may be quite 
reliable. Fay and Herriot (1979) and Ghosh and Rao (1994) 
demonstrated that shrinking direct estimates towards a 
model-based smoothed value can lead to substantial gains in 
precision. They proposed composite or empirical Bayes 
estimators that are weighted averages of direct and model- 
based estimators. That is, instead of either using the direct 
estimates or estimates obtained from generalized variance/ 
covariance modeling, we use a weighted average of the two 
estimators to potentially obtain even better estimates. 

Such weighted estimators can be constructed for domain 
variances using the model specified in (17) and (18). A 
natural approach is to weight the direct model-based 
estimators inversely proportional to the corresponding 
sampling and model error variances respectively (denoted 
Go, and 1% respectively for domain h). The resulting 
estimator for domain h (for variances and transformed 
correlations) is: 


2 Fdir a2 Fmod 
< ree h +6; Si, © pdr 
iy Te RD, Fo) re 
tA7-O; 


where f°" and f,°' denote the direct and model-based 
estimators. This generic formula applies to the variance 
estimates for all items, and correlation estimates for all pairs 
of items. The right-most expression has the form of an 
empirical Bayes estimator. 

If the direct and model-based variance estimators are 
independent, the variance of the resulting combined 
estimator is to; /(t? +0;)<min{t’,o;,}. Thus the 
composite is as least as precise as either of its two 
component estimators, improving on ad hoc selection 
between direct and model-based predictions. This is a useful 
strategy especially when model-based predictions improve 
on direct estimates for some, but not all domains. 


4. Example: CAHPS® Data Set 


The Consumer Assessments of Health Plans Study 
(CAHPS*) survey (Goldstein, Cleary, Langwell, Zaslavsky 
and Heller 2001) was designed primarily to elicit consumer 
ratings and reports on health plans. Plan mean scores 
(perhaps after recoding) on the various survey items are 
calculated and reported to consumers, health plans, and 
purchasers. Each analytic domain consists of the enrollees 
of a health plan (or geographically defined portion of one) 
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in a year; most of the plans are sampled in multiple years. 
The stratum is the reporting unit (plan or portion thereof) in 
a given year; reporting units corresponded to plans with the 
exception of a few large plans that had multiple reporting 
units. Therefore, there are many units for variance and 
covariance function estimation. 

We illustrate our methods with a CAHPS data set for 
beneficiaries of U.S. Medicare managed care plans, a 
system of private but government-funded entities serving 
from 5.7 to 6.9 million elderly or disabled beneficiaries in 
each year during our study period (1997 to 2001). Our data 
represent 381 reporting domains each sampled in 1 to 5 
years for a total of 932 distinct reporting unit by year 
domains with 705,848 responses. Because samples are 
drawn independently each year, patients may be sampled in 
multiple years. However, repeated sampling is rare and can 
be overlooked for our analysis. Therefore, the domains are 
strata with equal probability element sampling performed 
within each. Note that in CAHPS analyses no corrections 
are made for finite-population sampling since the data are 
collected to guide choices for future years rather than to 
record experiences of the specific population in a particular 
year. 

CAHPS items use a variety of ordinal response formats 
with either 11, 4, 3, or 2 response options. Overall ratings of 
doctor, specialist, care, and plan are measured on a 0 to 10 
scale from “worst possible” to “best possible’. Other items 
use a 4—point ordinal “frequency” scale (never/sometimes/ 
usually/always), or a 3—point ordinal “problem” scale (not a 
problem/somewhat a problem/a big problem), or are 
dichotomous (no/yes). Many items are answered only by 
respondents who used particular services or had particular 
needs, as determined by screener items. For example, an 
item about whether advice was obtained successfully by 
telephone is only answered by those who first reported that 
they attempted to obtain advice in that way. 


4.1 Descriptive Statistics 


Table | presents response distributions and domain mean 
distributions by item type. Missing observations due to 
structured skip patterns often occurred in blocks, with as 
many as 11 items skipped on the basis of a single screening 
question. Very little nonresponse (less than 2% on almost all 
items) was not due to a structured skip pattern. In this 
analysis we treat all types of nonresponse identically. 

Item response rates were lowest (as low as 4%) for 
problem items, several of which dealt with specialty 
services such as therapy or home health care needed by 
relatively few respondents. Some of the frequency and 
yes/no items also had low response rates. The greatest 
variation in the proportions of skipped items was evident 
among the yes/no items: 96.7% for a “complaint or problem 
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with plan” to 12.5% for “get prescription through plan’. 
Domain mean outcomes are in general concentrated towards 
the higher end of their scales, indicating that most responses 
were favorable. 


Table 1 
Distribution of Responses and Ratings Evaluated over Items 
of the Same Type (n = 705,848 Respondents) 


Statistic Numerical How Often Problem Yes/No 
Number of items 4 11 11 5 
Percentage responding 
Mean 74.97 62.56 30.32 peer 726 
Minimum 50.90 27.70 400 12.50 
Maximum 95.00 74.50 64.40 96.70 
Item means 
Mean 8.76 SET 2.70 1.78 
Minimum 8.57 3.09 2.49 1.62 
Maximum 8.88 3.84 2.86 LF 
Distribution of ratings (across items in group) 
0 : 
1 0.4 2.0 Wi Be ie 
2 0.4 6.3 124 80.5 
3 0.7 23.9 82.2 
4 0.9 67.8 
=) 4.6 
6 3.0 
7 Gz 
8 16.1 
: 17.8 
10 49.5 


Items are on a 0-10 numerical scale from “worse possible” to “best 
possible”, a 4—point 1—4 ordinal “frequency” scale (never/sometimes/ 
usually/always), a 3—point 1-3 ordinal “problem” scale (not a 
problem/somewhat a problem/a big problem), or are dichotomous 1—2 
items (no/yes). 

The domain mean, minimum, and maximum values 
across all items of the same type are also presented in Table 
1. These illustrate that the 0-10 items have the smallest total 
variation (after rescaling to the common 0-1 range), while 
the 1—2 items have the largest total variation across domains 
and items. This is also illustrated in Figure 1, where we 
observe that the distribution of the 1-2 items varies sub- 
stantially across items whereas the distributions of the 0-10 
items are more homogeneous. 

Table 2 presents statistical summary measures for the 
means and standard deviations of the domain mean ratings, 
evaluated across items of the same type. This complements 
Figure | by summarizing the difference in distributions of 
items within a given scale. Items with more response 
categories are concentrated towards the top of the scale and 
hence have smaller variance. For example, the mean 
standard deviation of the 1-2 items (0.36) is twice that of 
the rescaled 0-10 items (0.172). With the exception of the 
0-10 items, the distributions of domain mean ratings vary 
greatly across items of the same type. For instance, the 
standard deviation of the means of 1-2 items across items is 
0.30 compared to a rescaled standard deviation of 0.03 for 
the 0-10 items. 
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Figure 1. Five-point Summary of the Domain Sample Means 
for Each Item. The five-point summary consists of 
the minimum, 10" percentile, mean, 90" percentile, 
and the maximum. 


Table 2 
Summary Statistics of Domain Means and Standard Deviations 
Evaluated Over Domains and Items 


Type Summary Statistics for: 
Item Means Item SDs 
Min Max Mean SD Mean SD 


NumericalO-10 682 952 876 0.30 72 * 026 
Frequency 1— 4 ZOOS OMe Sor (| O12 0.66 0.09 


Problem 1-3 WSS 209 27 0MnOA4 057 21083 
Yes/No 1-2 134 e196 mF SIN Saw OOS 0.36 0.06 
Note: Columns 2 through 5 give the minimum, maximum, mean, 


and standard deviation of the domain item means across items 
of a given type. Columns 6 and 7 give the mean and standard 
deviation of the domain item standard deviations across items 
of a given type. 


Sample correlations also varied greatly across the pairs of 
items (Figure 2), although most were positive. Correlations 
between items of the same type most often were higher than 
those between items of different types. The numerical 0-10 
ratings had the largest correlations (mean = 0.49), and 
generally ratings with more categories tended to have higher 
correlations than ratings with fewer categories. Although 
most of the pairs of 1—4 items had mean correlations near to 
0.5, one item was negatively correlated with the others 
(revealed by the cluster of mean correlations below 0); this 
arose from reverse coding an item whose overall sample 
mean was not in the top half of the scale. The distributions 
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of the correlations of pairs of 1-2 items were centered near 
0, indicating that pairs of items of this type often have 
negative correlations. Complete item wordings and 
additional summary statistics appear in Zaslavsky, Beaulieu, 
Landon and Cleary (2000) and Zaslavsky and Cleary 
(2002). 

Models fitted to the variances and correlations are 
presented in the remainder of this section. Extensive 
checking of the best-fitting models indicated that the 
residuals did not follow any discernible pattern. 


Numerical item pairs Frequency item pairs 
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Figure 2. Five-point summary of the domain sample correlations 
between items with the same type. The five-point summary 
consists of the minimum, 10" percentile, mean, 90" 
percentile, and the maximum. 


4.2 Variance Functions 


In preliminary investigations not reported here, we fit 
two models within groups of items with the same response 
scale, one with common and one with different regression 
parameters for each item, to the data set comprising all of 
the items. Comparisons of the overall fits of the models 
(using criteria such as Mallow’s C a Tem adjusted R? ) and 
tests of the significance of effect-item interactions demon- 
strated that allowing parameters to vary across items signi- 
ficantly improved model fit. For instance, for the rescaled 
numerical ratings, weighted by domain sample size, the two 
models’ root mean squared errors were 0.446 versus 0.402, 
and values of R* were 0.783 versus 0.825. Based on this 
we decided to fit separate models for each item. 
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The variance functions (8-10) were fitted to each item 
except the yes/no items, which follow the binomial variance 
function in the equal-probability sampling case. The 
iterative procedure described in section 3.4 converged 
almost precisely in exactly two iterations. This is because 
the weights for the observations change only with the 
estimate of 1, and so very little change in the weights 
occurs after the first iteration. 

Table 3 presents the average sampling variation, average 
model error variation, and R’, for each model averaged 
over items of each response scale. Sampling variation, 
computed using (19), does not depend on the model. 


Table 3 
Goodness-of-fit Statistics for Variance Functions 

Rating Scale 0-10 1-4 1-3 
Sampling Variation 0.1460 0.3511 3.1703 

ModErr R* ModErr R* ModE R? 
Model V1 0.020 0.741 0.066 0.824 0.069 0.916 
Model V2 0.043 0.710 0.036 0.835 0.000 0.940 
Model V3 0.016 0.750 0.024 0.847 0.000 0.947 

Prob(ModErr < Sampling Variation) 

Model V1 0.968 0.916 0.996 
Model V2 0.858 0.967 0.996 
Model V3 0.981 0.983 0.996 
ModErr is the variance component for lack of fit, R? is as defined in 


section 3.4, Prob(ModErr < Sampling Variation) is the proportion of 
domains for which model error is smaller than sampling variation. All 
ant are rescaled to a O—1 scale, and model errors are multiplied by 

For items with few categories (more closely resembling 
the binomial), the quadratic component of the variance 
function tends to dominate the linear component, making 
models V2 and V3 fit better than V1. Because V2 imposes a 
constraint at a point far outside the range of the domain 
means, it does not fit the data as well when there are more 
categories and the data are consequently further from 
binomial. The 0-10 items are less dispersed than the 1-4 
and 1-3 ratings, enabling the linear model to fit better. The 
R* values for model V3 were close to 0.75 for numerical 
(0-10) items, 0.85 for the frequency (1-4) items, and 0.95 
for the problem (1-3) items. 

The lower portion of Table 3 displays for each item the 
proportion of domains (of those with at least 2 responses to 
the given item) for which sampling variation is larger than 
model error variation. For over 90% of domains, model 
error variation was less than the sampling variation of the 
direct variance estimate. 

Figure 3 illustrates the fit of V3 for two each of the 0-10, 
1-4, and 1-3 items. Illustrations for the remaining items are 
similar, but are not provided due to space limitations. The 
fitted curves are constrained to 0 at the maximum ratings. 
To assess the impact this constraint has on the fitted 
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variance function, we also fit an unrestricted (three para- 
meter) quadratic variance function; these attained values 
very close to 0 at the maximum rating, and closely approxi- 
mated the fitted curve from the constrained models, further 
supporting V3. 

Average parameter estimates and their standard 
deviations over items of the same type are shown in Table 4. 
The parameters differed substantially across items, sup- 
porting the decision to estimate separate regression 
coefficients. In most cases the coefficients for both the p,, ; 
and p, ;(1—p,,;) terms in V3 were significant, indicating 
that these are needed for generalized variance modeling. In 
some cases (particularly with the 0-10 items) the coefficient 
of the p,,(1—p,,;) term was negative, resulting in an 
estimated variance function that is convex rather than 
concave (the shape of the binomial variance function). This 
can happen when the sample means for the ratings are 
concentrated on a small proportion of the response scale, 
over which the linear term explains much of the variation in 
the data. As mentioned earlier, adding higher-order poly- 
nomial or logarithmic functions of p, ; did not significantly 
improve model fit. 
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Table 4 
Average Variance Function Parameter Estimates for Each Type of 
Item and Standard Deviations Across Items (in Parentheses) 


Model Item Type 
0-10 1-4 1-3 
Bi B» B, By Bi Bo 
Vi 0.236 ~ 0.354 0.569 - 
(0.016) - (0.039) - (0.068) - 
V2 - 0.271 - 0.421 ~ 0.711 
- (0.020) - (0.034) ~ (0.069) 
V3 0.334 -0.114 0.151 0.241 0.239 0.420 
(0.143) (0.155) ©.104) (0.132) (0.112) (0.110) 


See Table 1 for a description of the 0-10, 1—4, and 1-3 items. 


4.3 Correlation Functions 


Models are ordered from simplest (C1, the constant 
model) to most complex (C5, containing all linear and 
quadratic terms). As for the variance models, statistical tests 
found highly significant item interaction effects, implying 
that separate models should be fit for each pair. We did not 
expect all pairs of items to have similar correlations, since 
by intention the items are divided into internally consistent 
groups, each of which measures a distinct aspect of patient 
experiences such as interactions with doctor or dealings 
with customer service agents (Hays, Shaul, Williams, 
Lubalin, Harris-Kojetin, Sweeny and Cleary 1999). 


_preincncencttenttaemats 
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Figure 3. Quadratic Variance Function (V3) of Two Items for each Rating Type. 
Each point is the average of 60 domains. Vertical lines join the 10" and 
90" percentiles of the distribution of the variances. For this and 
following displays the direction of the transformed horizontal axis has 
been reversed to agree with that of the original variables. 
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The fits of the correlation models for pairs of items of the 
same type are summarized in Table 5. Over the range of 
models considered, the biggest improvements in model 
performance (as measured by R’ ) occur between model C1 
and model C2, and between model C3 and model C4. For 
example, the average R* for the numerical ratings in 
models C3-—CS5 are 0.0391, 0.1494, and 0.1508 respectively, 
and the average R* for the 1-4 ratings over C1-C3 are 0, 
0.0700, and 0.0789 respectively. This suggests that C2 and 
C4 are the best models for different pairs of items, a claim 
that is supported by the hypothesis tests on the significance 
of the incremental improvements in model fit. 

Sampling variation was highest for the 1-3 ratings, at 
least in part because high rates of non-response due to 
skipped responses diminished the sample sizes. Model error 
and R* of correlation models for items of different types 
were similar to those for models for items having the same 
type. 

The R* values for the correlation models were between 
0.029 and 0.15 for all pairs of items. Although there was no 
evidence to suggest that C4 was an inappropriate model for 
the correlations, these results indicate that substantial 
variation in the correlations is not explained by the item 
means. 
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The sampling variances of the direct estimates were often 
less than the corresponding model error variances (lower 
part of Tables 5 and 6 especially for the 0-10 items. Under 
C4, model error variances were smaller for only 13% of 
domains for the 0-10 ratings, 45% of domains for the 1-4 
ratings, and approximately 81% of domains for the 1-3 and 
1-2 ratings. 

Figure 4 presents the observed correlations and fitted 
function C4 for an illustrative pair of items from each of the 
10 combinations of item types, representing the 595 distinct 
pairs of items. To illustrate the fitted correlation models, we 
adjust the observed and fitted correlations to the mean of 
one item and plot the resulting values in two-dimensional 
space. This process is repeated for the other item, yielding 
two plots for each correlation. 

Figure 4 illustrates the generally weak relationship of the 
correlation to the means of the items seen in Tables 5 and 6. 
Analysis of Tables 5 and 6 reveals that the relationship 
between the correlation and the mean outcome is weaker for 
items with fewer categories and with correlations of items of 
different types. In particular, the 0 —-10 numerical ratings are 
the only group for which there is a clear correlation-mean 
relationship. 


Table 5 
Model Fitting Diagnostics for Correlation Functions for Items of the Same Type, Averaged over Pairs of Items of the Same Type 
Rating Type 0-10 1-4 1-3 1-2 
Sampling Variation 0.0124 0.0178 0.1482 0.0325 
ModErr R? ModErr R? ModErr R? ModErr R? 
Model Cl 0.060 0.000 0.028 0.000 0.112 0.000 0.018 0.000 
Model C2 0.060 0.013 0.025 0.070 0.103 0.048 0.017 0.014 
Model C3 0.057 0.039 0.024 0.079 0.102 0.054 0.017 0.018 
Model C4 0.047 0.150 0.023 0.100 0.100 0.068 0.016 0.029 
Model C5 0.044 0.151 0.023 0.105 0.096 0.080 0.015 0.034 
Prob(ModErr < Sampling Variation) 
Model Cl 0.033 0.339 0.461 0.788 
Model C2 0.033 0.400 0.498 0.795 
Model C3 0.034 0.411 0.502 0.796 
Model C4 0.038 0.435 0.516 0.799 
Model C5 0.065 0.440 0.530 0.802 
See Table | for a description of the 0-10, 1—4, 1-3 and 1-2 items, and Table 3 for an explanation of the column headings. 
Table 6 
Model Fitting Diagnostics for Correlation Functions for C4 by Type of Item. 
Averaged over Items of the Same Type 
Types Se ee ee eee eee 
ModEr  R* ModEr R* ModErr R* Moder R* 

0-10 0.047 0.149 0.021 0.104 0.040 0.094 0.013 0.059 

1-4 0.023 0.100 0.038 0.076 0.013 0.039 

1-3 0.100 0.068 0.028 0.031 

1-2 0.016 0.029 

Prob(ModErr < Sampling Variation) 

0-10 0.038 0.358 0.523 0.784 

1-4 0.435 0.605 0.790 

1-3 0.516 0.827 

1-2 0.799 


See Table | for a description of the 0-10, 1—4, 1-3 and 1—2 items, and Table 3 for an explanation 


of the column headings. 
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Figure 4. Correlation Functions for One Pair of Items for Each Combination of Rating Types. 


Note: 


The plots for each items involved in the correlation are side by side. Refer to 


Figure 3 for a description of the contents and axes of the plot. 


Although the fitted curves for the correlation functions 
are nearly flat, the variation in the parameter estimates under 
model C4 for a, are large and were suggestive of 
instability. The wildly varying parameter estimates are a 
consequence of collinearity among the predictors in model 
C4. In many cases the estimated value of a, offsets the 
parameter estimates for the linear predictors, resulting in a 
fitted curve that is nearly flat. 


4.4 Mean Difference Functions 


The difference Digg appeared to depend on both the 
marginal mean and its square, implying a model analogous 
to V3 could be appropriate. However, because De i 
typically is small enough that (bhp i De: ; fas minimal 
impact on (16), we fit a constant model. 


4.5 Composite Estimator 


Table 7 presents the quantiles of the distribution of 
weights 6; /(t” + o;,) for the model-based estimate, used in 
the composite estimator of section 3.5, averaged over items 
(or pairs of items) of the same type. The proportion of 


domains for which the standard error of the model-based 
predictions was smaller than that of the direct estimates is 
also presented. As noted previously, the model-based pre- 
dictions have more weight in the composite variance 
estimates than in the composite correlation estimates. The 
average (across items or pairs) median of the weights of the 
model-based estimator ranged from 0.892 to 1.000 for 
variances, 0.256 to 0.709 for correlations of items of the 
same type, and from 0.468 to 0.738 for correlations of items 
of different types. Also, for both variances and correlations, 
the weight of the model-based predictions was larger for 
items with fewer response categories. For example, the 
model-based estimator had median weights of 0.256, 0.468, 
0.540, and 0.647 on the composite estimates of correlations 
when the numerical 0-10 ratings were paired with the 0-10, 
1-4, 1-3, and 1-2 ratings, respectively. However, even for 
pairs of 0-10 numerical ratings, for which sampling error of 
the direct estimator exceeded the model error in only 3.81% 
of domains, these results indicate that the median weight of 
the model-based estimator was 0.256, a nontrivial amount. 
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Table 7 
Distribution of Weights for the Model-Based Component of the Composite Estimator, Averaged Over 
Items of Same Type 


Model Item Type Prob(ModErr < Quantiles 
1 2 Sampling Variation) 10% Median 90% 
Variance 0-10 — 0.981 0.778 0.892 0.948 
1-4 - 0.983 0.948 0.966 0.974 
1-3 ~ 0.996 1.000 1.000 1.000 
Correlation 0-10 0-10 0.038 0.141 0.256 0.335 
0-10 1-4 0.358 0.301 0.468 0.562 
0-10 1-3 0.523 0.357 0.540 0.654 
0-10 1-2 0.784 0.531 0.695 0.767 
1-4 1-4 0.435 0.324 0.497 0.591 
1-4 1-3 0.605 0.404 0.587 0.699 
1-4 1-2 0.853 0.584 0.738 0.805 
1-3 1-3 0.516 0.349 0.540 0.675 
1-3 1-2 0.827 0.584 0.737 0.817 
1-2 1-2 0.799 0.541 0.709 0.780 


The distribution of weights is summarized by the 10°, 50", and 90" percentiles. See Table 3 for definition of ModErr. 


4.6 Joint Predictions 


Because we modeled the correlations independently for 
each item, our fitted correlation matrices do not necessarily 
satisfy the constraint of positive definiteness, which can be 
important for multivariate inference. In additional work, we 
have determined that as long as the multivariate analysis is 
restricted to items of the same type, the fitted correlations 
from the C2 and C4 models yield positive definite estimates 
of correlation matrices for almost all domains. However, for 
analyses including items of different types (e.g., the 0-10 
numerical items, and the 1-2 yes/no items), predictions 
based on C4 predict correlation matrices that are indefinite 
for many domains, while predictions based on C2 are more 
stable and almost always yield positive definite predictions. 
This suggests that while C4 may be slightly superior in 
terms of univariate model fit, C2 may be more appropriate 
for multivariate inference. 

One way of overcoming the problem of indefinite 
predicted correlation matrices is to use a weighted average 
of the predicted correlation matrix for a domain and the 
estimated average correlation matrix (EACM) across 
domains. The EACM may be constructed by weighting the 
direct estimates (each of which is at least positive semi- 
definite) by the total sample size for each domain. Then any 
indefinite predicted correlation matrices are replaced with 
the weighted average of the predicted correlation matrix and 
the EACM, where the weight used for each domain is 
increased until a positive definite matrix results. Like an 
empirical Bayes estimator, this process stabilizes estimates 
by effectively shrinking the model coefficients toward those 
of a simpler (constant) model. 

When analyzing all 35 CAHPS items simultaneously the 
EACM had an average weight across domains of 0.65 with 
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model C4, whereas with model C2 the average weight was 
only 0.01 since the predicted correlations under C2 were 
usually positive definite. In analyzing only the 0-10, 1-4, 
and 1-3 items the EACM had average weights of 0.28 and 
0.00 with C4 and C2 respectively, while in analyzing just 
the 0-10 and 1-4 items the corresponding average weights 
were 0.06 and 0.00. When analyzing the different types of 
items separately, the average weight of the EACM with C4 
was 0.00 for the 0-10 and 1-4 items, 0.01 for the 1-3 
items, and 0.17 for the 1-2 items. The EACM is thus not 
needed when analyzing the 0-10 and 1-4 items because the 
predicted correlation matrices were positive definite for 
every domain. 


5. Conclusion 


We have presented methodology for estimating variance 
and covariance functions for domain means of ordinal 
survey items. Our methodology can also be applied to 
survey items measured on continuous scales. We introduced 
a decomposition of the model error that allows the variation 
due to sampling to be separated from that due to model fit. 
The decomposition also helps to avoid over-fitting because 
it estimates the proportion of variation in the data that can be 
modeled and thus when the current predictors suffice. 

The procedure for fitting the variance and correlation 
models is the same regardless of whether or not the data 
contain skip patterns. The analytic derivation in section 3.3 
shows that if skip patterns are present, mean differences of 
items by response status of other items are required in order 
to compute the sampling covariance estimates. However, 
we argued that these quantities are likely to have minimal 
impact on the results and that therefore a constant model 
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could be used, which was supported by our empirical 
findings. 

A quadratic variance function constrained to 0 at the 
maximum rating, and a model for transformed correlations 
involving the product but not the squares of the means, best 
predicted the direct estimates in our applied example. The 
modeled variance estimates generally had much smaller 
standard errors than the direct estimates; the same was, 
however, not true of the correlation estimates. It is 
interesting and reassuring that our quadratic variance 
function can be expressed as the widely-used relative 
variance model of Wolter (1985). 

For our ordinal data, the estimates of the domain mean 
ratings contain minimal information about the correlation 
between the ratings. Hence, the mean-covariance relation- 
ship is principally an artifact of the mean-variance relation- 
ship. However, for items with many response categories, the 
association between correlations and mean outcomes for 
items of the same type was stronger most notably for pairs 
of 0-10 items. With the exception of the 0-10 and possibly 
the 1-4 ratings, the correlations might as well be modeled 
as constants, which also makes it easier to guarantee 
positive definiteness of the predicted correlation matrix. 
However, it is important that the parameters of the 
correlation model be allowed to vary across pairs of items. 

A composite estimator that weights the direct and model- 
based estimators proportional to their precisions has smaller 
variance than either estimator alone, especially when the 
components have close to equal weight. The model-based 
estimator had the greatest influence on estimates for small 
domains, for which little information is available. The 
model-based estimator had the greatest influence on 
estimates for variances, followed by correlations of items of 
the same type, and lastly correlations of items of different 
types. Both model-based and composite estimators can be 
benchmarked (ratio adjusted) to agree on the average across 
domains with direct estimates, although this proved to be 
unnecessary in our example. 

GVCFs find several applications in our continuing 
research. We are developing quasi likelihood-based 
methods for estimating covariance matrices for the domain 
means of ordinal survey items, representing the second-level 
(structural) covariance in a hierarchical model (O’Malley 
and Zaslavsky 2004). GVCEF models are needed to provide 
estimates of sampling variances and covariances and to 
modify those estimates as the means are re-estimated during 
the fitting procedure. If the sampling variability of the 
GVCE estimates is minimal because the number of domains 
is large, the GVCF predicted variances and covariances can 
be treated as known. However, if the sampling error of the 
GVCF-based estimates is large a model that allows these 
errors to propagate through the analysis should be used. In 
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related work, Fay and Train (1997) used a binomial model 
with a design effect for each domain in empirical Bayes 
estimation of binomial rates. Our research extends this 
approach to multivariate estimation and more general 
response formats. 

Another application of GVCFs is the computation of 
variance estimates for linear combinations of item means, 
facilitating variance estimation for composite scores, like 
those used in CAHPS reporting. The methods described in 
section 2 are applicable to variance estimation for any 
functions of totals, including functions of means, other 
ratios, or regression coefficients. 

There are several ways of extending the GVCF 
methodology. In addition to summary measures of 
outcomes, generalized variance and covariance functions 
(GVCFs) may also depend on other independent variables, 
in particular those that would better predict correlations. We 
considered variables summarizing response patterns, such as 
the proportion of respondents in a domain, but these did not 
improve the model. GVCFs could also be extended to multi- 
stage sampling. 
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Spatio-Temporal Models in Small Area Estimation 


Bharat Bhushan Singh, Girja Kant Shukla and Debasis Kundu ' 


Abstract 


A spatial regression model in a general mixed effects model framework has been proposed for the small area estimation 
problem. A common autocorrelation parameter across the small areas has resulted in the improvement of the small area 
estimates. It has been found to be very useful in the cases where there is little improvement in the small area estimates due to 
the exogenous variables. A second order approximation to the mean squared error (MSE) of the empirical best linear 
unbiased predictor (EBLUP) has also been worked out. Using the Kalman filtering approach, a spatial temporal model has 
been proposed. In this case also, a second order approximation to the MSE of the EBLUP has been obtained. As a case 
study, the time series monthly per capita consumption expenditure (MPCE) data from the National Sample Survey 
Organisation (NSSO) of the Ministry of Statistics and Programme Implementation, Government of India, have been used 


for the validation of the models. 


Key Words: Mixed effects linear model; Spatial autocorrelation; Weight matrix; Best linear unbiased predictor; 
Empirical best linear unbiased predictor; Kalman filtering; NSSO rounds. 


1. Introduction 


Local level planning requires reliable data at the appro- 
priate level. The complete enumeration or large sample 
surveys with adequate sample size is expensive and time 
consuming. The censuses are usually carried out once in a 
decade, while the sample surveys are often planned to 
provide estimates at much higher level. One such large 
sample survey is socio-economic survey of National Sample 
Survey Organisation (NSSO). Here the direct survey 
estimates are available at small area (district) level as most 
of the districts are stratum in the sampling procedure 
adopted by the NSSO. However, the estimates are exceed- 
ingly unreliable due to unacceptably large standard errors. 
This requires strengthening of such estimates with the use of 
information from similar small areas or with the help of 
some relatable exogenous variables, easily available and 
related to the variable under study. 

Various model based approaches have been suggested to 
improve the direct estimators. The model-based approach 
facilitates its validation through the sample data. The simple 
area specific model suggested is two stage model of Fay and 
Herriot (1979). 


y,=0,+¢,, E(e;10,)=0, Var(e,10,)=07; — (1.1) 


Daa Bty.z. oF (.)=0, Varky.)=0-51 = 1,2, 2-27 Medili.2) 


Here y,’s are direct survey estimators of 0,’s, the 
characteristic under study. 8; ’s may be population small 
area means. X,; =(X,,..., x Ay "s are exogenous variables 
which are available and assumed to be closely related to 
Q.’s and z,’s are known positive constants. B(p x1) is the 
vector of regression parameters. 


The first equation (1.1) is the design model while the 
second (1.2) is the linking model. The €;’s are sampling 
errors. Estimators y,’s are design unbiased and the 
sampling variances 67 ’s are known. Further the €,’s and 
v,’s are identically and independently distributed random 
variables. Normality of the random errors and random 
effects are often assumed. For this model, best linear 
unbiased predictor (BLUP) on the line of the best linear 
unbiased estimator (BLUE) has been suggested. The 
estimate is design consistent and model unbiased (Ghosh 
and Rao 1994). It is typically the weighted average of the 
direct survey estimator y, and the regression synthetic 
estimator X/B. The BLUP estimator depends on variance 
component 6? which is unknown in pratical applications. 
Various methods of estimating variance components in 
general mixed effects linear model are available (Cressie 
1992). By replacing 6+ with an asymptotically consistent 
estimator 6°, an empirical best linear unbiased predictor 
(EBLUP) has also been obtained. 

The main problem associated with the data in the Indian 
context is the non-availability of administrative or civic 
registration data at small area level. Often, it is difficult to 
find out the exogenous variables closely related (multiple 
correlation coefficient R*>0.5) to the variable under 
study. 

In the present paper, the exploitation of spatial auto- 
correlation amongst the small area units in the form of 
spatial model, has been considered for improving the small 
area estimators. Besides this, for the time series data, a 
spatial temporal model on the line of Kalman filtering has 
been utilised to further improve the estimators. Time series 
data on monthly per capital consumption expenditure 
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(MPCE) as estimated from a large sample survey carried out 
by the National Sample Survey Organisation (NSSO) has 
been studied. In the present paper, we propose suitable 
models in the framework of mixed effects linear model to 
provide better estimators of the MPCE at small area level. 

Rest of the paper has been organized as follows. In 
Section 2, we consider a Spatial Model on the line of 
general mixed effects linear model with the introduction of 
spatial autocorrelation among the small area units. The 
BLUP and EBLUP of the mixed effects have been 
presented. A second order approximation to the MSE of the 
EBLUP and to the estimator of the MSE has also been 
obtained. Section 3 deals with the time series extension of 
Spatial Model in form of Spatial Temporal Model, using the 
Kalman filtering approach. The BLUP and the EBLUP of 
the mixed effects along with a second order approximation 
to the MSE of the EBLUP and to the estimator of the MSE 
have been discussed. Section 4 presents and analyses 
estimates of the MPCE from a large sample survey carried 
out periodically in India. The conclusions of the data 
analysis are reported in Section 5. All the proofs have been 
provided in the Appendix. 


2. Spatial Model 


The small area characteristics usually have the spatial 
dependence in terms of neighbourhood similarities. Cressie 
(1990) used conditional spatial dependence among random 
effects, in the context of adjustment for census undercounts. 
Here, we use simultaneous spatial dependence (Cliff and 
Ord 1981) among the random effects which has certain 
advantage over conditional dependence (Ripley 1981). We 
have thus tried to explain a portion of the random error 
unaccounted for and left over by explanatory variables 
which makes it possible to improve the direct survey 
estimators. The proposed model is a three stage area specific 
model (Ghosh and Rao 1994). 


y=O0+e, e~ N_(0, R), (2.1) 
9=XB+u, (2.2) 
u=pWurty, v~N_ (0,021), (25) 


where 9 is a m-component vector (corresponding to 
number of small areas) for the characteristic under study and 
y is its direct surevy estimator obtained through small 
sample data. In the above model, the first equation (2.1) 
shows the design (sampling) model, the second equation 
(2.2) shows regression model and the third one (2.3) shows 
spatial model on the residuals, the later two are linked in the 
first equation. The above model can be expressed as 


y=XB+Zv+e, Z=(1-pW)", (2.4) 
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where X (mx p) is the design matrix of full column rank p, 
B(p x1) is a column vector of regression parameters and 
Z(mxXm) represents the coefficients of random effects v. 
W(mxXm) is a known spatial weight matrix which shows 
the amount of interaction between any pair of small areas. 
The elements of W =[W,,] with W, =0 Vi may depend 
on the distance between the centers of small areas or on the 
length of common boundary between them. As a simple 
alternative, it may have binary values W, =1 (unscaled) if 
j' area is physically contiguous to i” area and W,, =0, 
otherwise. The matrix has been standardised so as to satisfy 

my W, =1 for i=1,2,...,m. The constant p is a measure 
of the overall level of spatial autocorrelation and _ its 
magnitude reflects the suitability of W for given y and X. 
Further v and € are assumed to be independent of each 
other. R is a diagonal matrix of order m which may be 
expressed as R= diag(o7, on pre Go.) where an “Sate 
known sampling variances corresponding to the i™ area. 
The parameter vector y=[p, 67]’ has two elements. 

In this model the strength is borrowed from the similar 
small areas through two common parameters viz. regression 
parameter B and autocorrelation parameter p. Note that the 
present model is a more general model and the model of Fay 
and Herriot (1979) can be obtained from this by taking 
p=0. 

By adopting the mixed effects linear model approach 
(Henderson 1975), the best linear unbiased predictor 
(BLUP) of 8=X{8+Zv and the mean squared error 
(MSE) of the BLUP may be obtained as 


6c) = X Boy) + AGW) Ly — XBCW)] 


=0,A7 (WE "(Wy + RE" (y)X BOW), (2.5) 
MSE[6(y)] = 

E[(6(y) -— &)(6(y) - 6)" ]= g,(y) + 8. (W), (2.6) 

SQW) = AQY) R=R-RE"(W)R, (2.7) 

8 (W)= RE (W)X(XTIN(W)X)TXTITW)R, (2.8) 


Bay) =[XZ7T (WX XTZI Wy, 
LI) =G,A'(W) +R, 
AQ) =07A (WE (y), AQy) = (7 -pW)' (I-pWw). 


Here 8, © and A, all are the functions of w and usually 
have been expressed as Bay), xX(y) and A(y) respect- 
ively. However, sometimes due to brevity, the suffix wy has 
been omitted. The first term, g,(y) in the expression for 
the MSE, shows the variability of 6 when all the 
parameters are known and is of order O(1). The second 
term, g,(y), due to estimating the fixed effects B, is of 
order O(m~') for large m. Further, with p =0, the above 
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model reduces to the standard mixed effects linear 
regression model while for X B=, we obtain a purely 
spatial scheme with only intercept term. 

In practice parameter y is unknown and is estimated 
from the data. The maximum likelihood estimator (MLE) of 
the parameter, y is obtained by maximizing the following 
log likelihood function of y 


1 =const — ; log [I ZQy) I] 


-SLy~ XB)" (woly ~ Xow) 29) 
with respect to the parameter y. The empirical best linear 
unbiased predictor (EBLUP), A(y) and the naive estimator 
of the MSE are obtained from the equations (2.5) and (2.6) 
respectively, by replacing the parameter vector y by its 
estimator W. 


OAH) =S2AT PZ! (Wy + RI" CWX BM, 
MSE[O(H)] = 2, () + 2), 
where L()=62A'(H)+R 
and A(y)=(1-pW)' (1 -6W). 


This expression for the MSE of the EBLUP severely 
underestimates the true MSE as the variability due to the 
estimation of the parameters through the data has been 
ignored. We obtain a second order approximation to the 
MSE[6()] in case W is the maximum likelihood esti- 
mator (MLE) or the restricted maximum likelihood 
estimator (REMLE) of y, with the assumption of large m 
and by neglecting all the terms of the order o(m™'), under 
the following regularity conditions. The approximation has 
been worked out along the lines of Prasad and Rao (1990) 
and Datta and Lahiri (2000) which are heuristic in nature. 


(2.10) 


it) 


Regularity Conditions 1 

(a) The elements of X are uniformly bounded such that 
X'X (Ww) X=[O(m)] where L(w)= [07A | (y)+ 
R]; 


pxp? 


(b) mis finite; 


(c) AQY)X =LOD np» COLAC) X D/A 4 )= LOD Imp » 
(O° ACW) OY, OW.) =[OD non for d,e =1, 2; 


(d) W is the estimator of yw which satisfies ~-—yw= 


0, (m™'"*), Wy) =), Wy + xh) =P) Vhe R? 
and Vy. 


These regularity conditions are satisfied in this case. The 
special standardised form of the weight matrix W satisfies 
the condition (c) for |p1<1 as it has only a finite number of 
nonzero elements and its row sum is equal to 1. It may be 
mentioned here that the matrix 67A~'="' has finite number 
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of nonzero elements and the order of W,(J/—pW), 
W(U-pW),2, Set a any sum or product combination of 
these and their derivatives mentioned in condition (c) do not 
increase. The MLE and the REMLE, in addition satisfy the 
condition (d). A second order approximation to the MSE of 
the EBLUP has been shown in Theorem A.1 of the 
Appendix as 


MSE[6(H)] = ELC) — 860) — 6)" ] 
= g,(w) + 2,(w) + 83(W) + o(m™). (2.12) 


Here the third term g,(y) comes from estimating the 
unknown parameter vector from the sample data and it is of 
the same order O(m™') as that of g,(y). Further g,(w) 
may be expressed as 


g(W=L Wy Cy) ®@ZWILCy), (2.13) 
where 
LW) = Col [L, (W)]=L,(W). La (Wl, 
JAW) 07] 
1 is z= a =15240 1 = E[- 
‘d Cy) ov, 9 ) y(W) [ dvow! ] 


is the information matrix and ® represents Kronecker 
product. Further g,(y) may also be written as 


2 Ps 
BW=> YH LWIWL WZ) (2.14) 


d=1 e=1 
where J)’ (yw) =(1j, (w)). 


It is common practice to estimate the MSE of the EBLUP 
by replacing the unknown parameters including components 
of the variance by their respective estimators. This proce- 
dure can lead to severe underestimation of the true MSE 
(Prasad and Rao 1990, Singh, Stukel and Pfeffermann 
1998). We obtain the estimator of the MSE of the EBLUP 
in Theorem A.2 of the Appendix for large m neglecting all 
terms of order o(m™'). As a result we have the expressions 


Elz, +a, -8,)-8,(W]=s,(w) + o(m™), (2.15) 


Elg.(W] = g,(y)+o0(m"') 


and Elg,(W)]=g23(w)t+o(m"), (2.16) 
and finally the estimator of the MSE of 6() as 
mse[6(y)]= 

[e,(h)+2,()+22,(()-2,()-2,()]+0(m"), (2.17) 


where E[mse(6()))]=MSE[6()] + 0(m”). 


Obviously the additional terms, g,;(), g,() and 
g;() are the contributions, due to estimation of unknown 
parameter vector y by . The expressions for g,(y) and 
gs(W) up to order o(m') are given by 
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0 
2, (W) =[b? (W) @ I] ate 


ole mall (2.18) 
oy 


d 


b, (yw) =— ae a (W)C ae oT , (W) 


[7, @(R=U'(w))] 


1 
(W=5T (2.19) 
LENS debe Wee cone yo" (WR) | 


Here b,(y) is the bias of W ie., EQY)—w up to order 
o(m') and (dg,(y))/(Oy) is a partitioned matrix 
[(dg, Cy))/(Ap), (0g, Cy) (0G2)]’ of order = (2mxm) 
having 2 matrices of order mxXm in a column. In the same 
way (0° X(y))/(Owdy’ ) is a partitioned matrix of order 
(2mx2m) having 2 partitions, row and column wise with 
(0°X(y)) (Ow, 0y,) being a general sub matrix of order 
mxm therein. Trace(B) = >7_,B,,, where B is a square 
partitioned matrix with square sub matrices of similar order. 
In addition g,(y) and g;(y) may also be written as 


g4(W)= 
: 4 a dg, (W) 
spay Fey tied (y) =e age ia (2.20) 
g,(W)= 
2 
sD Ele ay) aw em (W)RIZ | (2.21) 
d=1 e=l 


The expression (2.17) gives the matrix of the estimator of 
the MSE of EBLUP, 6(() and the MSE of the individual 
small area estimators may be obtained as the respective 
diagonal element. In case of simple model without the 
spatial autocorrelation, similar expressions can be obtained. 
In this case g;(y), however, becomes zero. 


3. Spatial Temporal Model 


In this section, State Space Models via Kalman filtering 
have been used to take the advantage of the time series data 
along with the common regression parameter and common 
autocorrelation parameter to strengthen the direct survey 
estimators at any point of time. This is_ especially 
advantageous in the case where the past survey estimates are 
more reliable. The models used in this category are the 
following 
y, = X,B+Zv, +€,, €, uN (0,R,),Z=(U-pw)', 


m 


(3.1) 
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mami cin nesters ct cand 


€, and n, are independent of each other. 


V, = ky, 1 eae die 
G2) 


Here the parameters have usual meaning as explained in 
the previous section. Weight matrix W(mxm) and design 
matrices X,(mxXp) are known, Z(mxXm) is a matrix of 
coefficients of random effects and p is an unknown 
autocorrelation coefficient. R, is a diagonal matrix of order 
m which may be expressed as R, = diag(6;,, 03,,..-,0.,) 
where 6; ’s are known sampling variances corresponding 
to the i small area and t™ time point. B is unknown 
vector of fixed effects and yw=[p,07,k]’ is a vector of 
three unknown parameters. These parameters are 
independent of time ¢. It may be noted that the random 
effects v, have been allowed to change in accordance with 
(3.2) and k is temporal autoregressive parameter. For 
stationarity |kI<1. 

The estimators of fixed and random effects and the MSE 
of these estimators are obtained in stages, starting with 
assumption of mixed effects linear model approach at time 
t=1, and by taking v, ~ N,,(0,077) (Sallas and Harville 
1994). In the standard form we write the model as 


y, =U,0, +€,, 0, =To,,+6,,T =diagll,,,], 3.3) 
C.~Ny,(0,Q), O=diagl0,, 0 Jy] 
U, =[X,, 2], 0, =[B,, v1. (3.4) 


Here J,, and O,, are the unit and zero matrices of order 
mand by diag[J/,,, A,,] we mean the matrix 


In case B is assumed fixed but dependent on time, there is 
no change in the model except that T = diag[0,, AT, ]. 
The initial estimates of the effects @, and their variances 
(based on t =1 ) are obtained as 
B, =(XPH,'X\)" X; Ay y,,0,=0,Z' A; (y, 


dy i 
X51 Xn 


- X,B,), 


Hho... Z| 


Zi (px p)=(Xp HX), 

yi) = a0 (Xe A 
and L,,(mxm)=0.1,,-0,Z' H,'Z 

+ OC2 HS XOX ay XS, 


The recurring Kalman filtering equations for updation of 
the estimators at subsequent stages are 
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Ts rae OF 6, (eae eR eeR DY) Wey 
G, al = Oy, 1 Fe aus ] L(y, 7 Ss RY 
x, = Di Lye pli Hy OLD 1 
where ©,,_, are the estimators of the effects a, given the 


observations [y,, y,..., y,,] and the 2,,, are the mean 
squared errors of &,,,. H, are the conditional variance 
covariance matrix of y, given [y,, y5,..., y,,]. With the 
help of the above recurring filtering equations, the best 
linear unbiased predictor (BLUP) of 0,=X,B+Zv,, and 
the mean squared error (MSE) of the BLUP may be 


obtained as 


6,(y) =U, We, (YW) 
= y,-R,H,'(wLy,-U, (Wé,,_,W)] 


=U, (WG, +A, (we, Cy), (3.5) 


MSE[6, (W)]= 82, (W) =U, (Wd, (WUT CW), (3.6) 


where A,(w)=U,(W)2,,, (WU; (WA; (W) 
=1,,-R,H; (yw) 
and ¢,(y)=y, —U, CY) Oi (y). 


It may be noted that g,,,(y) is the spatial counterpart of 
g,(y)+ g,(y). As usual in practice, the parameter vector 
yw is unknown and its restricted maximum likelihood 
estimators (REMLE) can be obtained by maximizing the 
following log likelihood function, based on the sample data 
covering all time points 


1 = const. + logll X, ON 7 fa 1-4 lost #, I 


t=] 


=F ~X,B,)’ H;'(, -X,B,) 


ig : Z A 
atid (y, rls Qoggs dE '(y, —U,O4, 4) (3.7) 
t=2 
with respect to the parameter y. With the help of the above, 
the estimator, is obtained and the EBLUP of 8, and the 
naive estimator of the MSE of the EBLUP are given by 


6,(H) =U, (HE, W) =U, WG, (H+ A, We, CW, (3.8) 


MSE[6,(Q)] = 8,00) =U,Mz,QWU; WH). 3.9) 


As explained earlier in section 2, the MSE of tte EBLUP 
underestimates the true MSE as it does not take care of the 
variability due to replacing parameters by their estimates. A 
second order approximation to the MSE[6, (y)] for large 
m and neglecting all the terms of order o(m™'), has been 
obtained in Theorem A.3 of the Appendix, under the 
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following regularity conditions satisfied by our model. 
These conditions are analogous to the regularity conditions 
if 


Regularity Conditions 2 


(a) The elements of X,,t=1,2,...,7 are uniformally 
bounded such that X/Z,"(y)X, =[O(m)],,.,,. where 
X,(w) =[,A“() + R,]; 

(b) mand T are finite; 

(C) A, CQW)U,QW)=lOMI np» COLA, WU, CW) /Oy,)= 


[OM Imp» (L0°A, QY)]) OW ZW.) = LOM Inem> t= 1, 
26,1 andiad, =U, Dy 3: 


PXp? 


(d) W is the estimator of y which satisfies ~—w= 


O,(m'"*), W-y)=Wy), Wy + xh) = Wy) Whe R? 
and V y. 


The second order approximation to the MSE of the 
EBLUP is 


MSE[O, (W)] = E[(6,() — 0, (6, () —8,)” ] 
= 2,(W) + 23,(W) + 0(m"). (3.10) 


Here g,,(y) is the bias due to the estimation of the 
parameters from the sample data and is of the order O(m"') 
and it is given by 


SW=L Wy WK WAL, WL,(y) G11) 


where K,(w) =(K,,(W)) 


and K,.(y)= $Y thse c jak | (3.12) 


i=] 


Further 

L(y) =Col[L,(y)] and L(y) = (0A, (W) Oy,) 
l<d<3 

fotade= 1,23: 


In a proper form, we may write g,,(W) as 


83,(W) = 

By) Wes) 5 

la WO 

f=) 2=1 
2. 3 wid bess OH: shixg 

La(w)| x> Trace] H;'— H; L(y). 
22, a i=l | OW oe 
x HI. (W) 


The expression for the information matrix involved here, 
may be given as 
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ag | 
OY OVW, 


1 
85 rn HE Bo ML 
2 Wa We | tt | OWa W. 


t=1 
(Xf HX)" XA. 


bs (y) as e- 


1 
~5 Trace | ee oe, ate, 
Ow,0Y, OW, OY, 
6X eigen XG wet Ha Xs 
l OW 
—— Trace inh i 
(XT AOR EX Hee x, 


e 


Estimator of the MSE of the EBLUP has also been obtained 
with the assumption of large m and neglecting all terms of 
order o(m™') in Theorem A.4 of the Appendix as 


mse[9, ()] =[8 12,0) + 83,0H) + 854. 0H) 


— 84,0) — 85,0] + 0(m™), (3.13) 
where g3,,(W), 84,(y) and g,,(y) are given as 

SW = Li WU, WS H,WWILy), 6.14) 

24, (W) =[02 (Y)® Ly] eee Be, 
b, => hw); Col 1| TF) Fs (W) i (3.15) 

OW 4 
85,0W) = 
1, ®(R,H,")] 

ra oh Ug @(H;R)f G16) 


4. Analysis of the NSSO Data 


National Sample Survey Organisation (NSSO) of the 
Ministry of Statistics and Programme Implementation (Gov- 
ernment of India) conducts quinquennial large sample 
surveys (QS) on household consumption expenditure and 
employment, almost every five years in India. The surveys 
cover more than hundred thousand households spread over a 
number of villages and urban blocks. In order to fill the gaps 
in data between the successive QSs, the NSSO conducts 
annual consumer expenditure survey (CES) in almost every 
round (equivalent to six months or one year duration). The 
annual series covers only 10-30 thousand households 
depending on the number of villages and urban blocks 
surveyed all over the country. Each round of NSS normally 
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has more than one subject of enquiry. The annual series has 
a different principal subject of enquiry. However schedule 
1.0 of the annual surveys is designed to collect data on 
household consumption expenditure among other character- 
istics on employment. 

The NSSO adopts two stage stratified sampling design, 
the first stage units being census villages in the rural sector 
selected through circular systematic sampling with proba- 
bility proportional to size (PPS) and the ultimate-stage units 
being the households selected circular systematically with 
independent random starts. India has been divided into 
States and the Districts are the second level administrative 
units in the States. There is not much difference between the 
annual and quinquennial surveys excepting that normally in 
annual series, a small sample of four households per first 
stage units are surveyed while in the case of quinquennial 
survey, ten to twelve households per first stage units are 
surveyed. Besides this, in NSSO surveys, we have two 
samples viz, the first one as central sample surveyed by the 
investigators of the NSSO, and the second one as state 
sample surveyed by the State authorities. Regarding the 
estimation procedure, the first stage units are selected in the 
form of two independent sub-samples. The estimate of the 
population mean and its variance based on the two sub- 
samples are separately obtained. The pooled mean y, = 
($,, + 5,,)/2 and R, =(3;;-—35;)? /4 for i=1,2,...,m 
where },,, $5; are the sub-sample means, estimate respect- 
ively the population mean and its variance for a particular 
district (small area). In case of round 55, first stage units are 
selected in the form of eight independent sub-samples and 
the estimate of the population mean and its variance are 
based on these sub-samples. In view of the problems related 
to the estimates of R,’s with 1 d.f., the R, for each small 
area were analysed and compared over time. In case of any 
abnormal R,., it was smoothed out by taking the average of 
R,’s over neighboring time points and in some cases, over 
neighboring small areas also. The survey estimates y, ’s are 
the direct estimates, and the smoothed R,’s are the diagonal 
elements of the sampling variance covariance matrix R, in 
our model equations (2.1), (2.4) and (3.1), referred in this 
paper. 

In this paper, we have used data from central sample 
only. The estimates of monthly per capita consumption 
expenditure (MPCE) and of the standard errors(SE) of the 
estimators have been obtained under various mixed effects 
models for the rural 63 districts (small areas) of a large state 
in India, namely, Uttar Pradesh. We have used data from the 
six rounds of the NSSO viz round 50 (July 1993-June 
1994), round 51 (July 1994-June 1995), round 52 (July 
1995-June 1996), round 53 (January-December 1997), 
round 54 (January-June 1998) and round 55 (July 1999- 
June 2000). Out of these rounds 50 and 55 are based on 
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quinquennial surveys. The selected exogenous variables 
used in the models are i) number of households, ii) gross 
area sown and iii) per capita net area sown in the districts. 
The agricultural data are available on annual basis while the 
estimates of the households and the population were 
obtained through the interpolation techniques based on the 
1971, 1981 and 1991 decennial census data. These 
exogenous variables have been selected from a host of 
variables ranging from 1991 census to annual agricultural 
data through the covariate analysis. Different weight 
matrices such as length of common boundary between a pair 
of districts, distance between centres of two districts and the 
binary weights were considered. Binary weights give larger 
estimate of spatial autocorrelation coefficient, therefore they 
(standardised by making row sum of the weight matrix as 
one) have been used for further analysis in this paper. In the 
whole exercise, maximization of log likelihood function and 
the estimation of the parameters have been carried out by 
using the Nelder and Mead simplex method on the software 
MATLAB. 

Various mixed effects models, used for finding out 
improved estimates of MPCE are given in Table 1. The 
parameters in the models have usual meaning as shown in 
sections 2 and 3. Further, in case of each model, sampling 
variance R or R, (in case of temporal model) are assumed 
to be known. 


Table 1 
Mixed Effects Models 


Model—1 Direct Estimates 

Model—2 Regression Model y=XBP+vt+e 

Model—3 Spatial Model y=XBP+Zvt+e 

Model—3A Spatial Model (intercept) y=u+Zv+t+e 

Model—4 Regression Temporal y,=X,B+v,+€,,v,=kvy, +N; 
Model—5 Spatial Temporal y, =X, B+Zv,+€,,v, =kv,1 +n, 
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Table 2 presents the round wise estimates of the para- 
meters for the simple mixed effects regression and spatial 
models. The value of the multiple correlation coefficients 
R* between MPCE estimates and the auxiliary variables, in 
case of each round has also been shown here. The figures in 
bracket show the Standard Errors (SE) of the parameter 
estimates. Note that A(=,,A,) is the likelihood ratio test 
(LRT) statistics defined as —2log L ~ Xe. where L is the 
ratio of nested likelihoods at the hypothesised parameter 
values for two competing models under different hypotheses 
and k is the difference between the number of parameters 
under two models. Here 1, compares regression model and 
spatial model, under H):p=0 against H,:p #0 and is 
distributed as yj under H,, and 2, compares spatial 
model and spatial (intercept) model, under H,:B=0 
against H,:B#0[B does not include intercept term B, | 
and is distributed as yj under H,. 

On comparison of the simple regression model (Model 2) 
and spatial model (Model 3) through LRT, we find that 
under H,(p =0), the spatial autocorrelation p for Model 3 
has been found highly significant for the two rounds 52 and 
55, obviously for these rounds, use of spatial model results 
in much improvement in the estimates of MPCE. On the 
other hand, in case of rounds 50 and 53, and for these only, 
the regression coefficients [ have been found nearly 
significant for the Model 3 in comparison to Model 3A 
which shows that the spatial model with intercept term may 
improve the estimates for these rounds without any help of 
the exogenous variables. 

Table 3 presents the parameter estimates and their SE in 
case of regression temporal model and spatial temporal 
model. 

For Model 4, unconstrained iterative maximisation 
process converged the value of k greater than 1, which is 
inadmissible under the assumption of stationarity. For this 


Table 2 
Estimates of Parameters for Small Area Estimates of MPCE Under Regression and Spatial Models 
Round R? Model 2 Model 3 LRT Model 3A LRT 
o- p o Ay p oe Ay 

Rd. 50 0.27 1,724.48 0.30 1,635.70 1.80 0.59 1,724.68 6.64 
(356.19) (0.18) (346.45) (0.13) (378.66) 

Rd. 51 0.27 3,424.21 0.48 3,156.90 0.66 0.67 3,077.32 4.54 
(820.89) (0.19) (815.24) (0.13) (824.54) 

Rd. 52 0.17 2,150.54 0.87 714.96 13.46 0.86 768.11 0.90 
(540.23) (0.07) (257.15) (0.07) (272.27) 

Rd. 53 0.13 6,312.99 —0.39 3,022.99 1.56 0.09 7,141.60 7.66 
(1,397.92) (0.27) (1,374.70) (0.23) (1,561.72) 

Rd. 54 0.22 3,437.67 0.61 2,793.24 1.30 0.66 2,888.66 3.00 
(806.87) (0.14) (742.35) (0.13) (768.84) 

Rd. 55 0.31 2,989.13 . 0.8] 1,060.21 20.30 0.86 1,186.58 1.56 
(712.28) (0.06) (362.40) (0.07) (394.27) 


A, and A, compare models 2,3 and models 3,3A respectively. Tits =3.841 for A, and an =7.815 for A>. 
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case, estimates were obtained by taking k =1 and Model 4 
was accordingly modified. Table 3 reports the results for 
k =1 in case of regression temporal model. The spatial 
temporal model shows higher value of common auto- 
correlation coefficient and far lower value of the estimate of 
o°. A summary of the round wise average estimates of 
MPCE (based on all the 63 districts), their estimated 
standard errors (SE) and the coefficient of variation (CV) 
under each model has been presented in Table 4. 

The results of Table 4 have been summarized below. 

The Direct survey estimates are less precise and all the 
models involving mixed effects improve it. The estimates 
for the rounds 50 and 55 (based on large samples) are more 
precise than the estimates based on other rounds. Spatial 
model, depending on the value of p improves the estimates 
considerably. In case of rounds 52 and 55, where the 
autocorrelation have been found significant, the reduction in 
the average SE of the estimates in comparison to the model 
without spatial autocorrelation, is considerable. Model 3A 
with spatial effect and without auxiliary variables is equally 


good. The spatial temporal model further improves the 
estimates taking into advantage of the state space consider- 
ations. It may be noted that for the round 52 (very high 
spatial autocorrelation), the estimates based on temporal 
models are worse than the estimates based on models 
without temporal considerations. Perhaps due to fixed 
regression and autocorrelation parameters, the estimates 
tend towards the average of the five rounds. 

In order to judge the performances of the estimators 
under various models vis-a-vis under the most general 
model (spatial temporal model), data have been simulated 
under the spatial temporal model and true MSEs of the 
replicated estimates under each of the assumed models have 
been obtained. For this, we have conducted the simulation 
by taking the estimated parameters from the spatial temporal 
model, given in Table 2 and obtained the true replicated 
small area mean 0(b) for b™ replication (b=1, 2,..., B) 
along with simulated observations y(b) for a large number 
of replications. On this simulated dataset, for each repli- 
cation, different models including spatial temporal model 


Table 3 
Estimates of Parameters for Small Area Estimation of MPCE Under Regression Temporal and 
Spatial Temporal Models 


p 5, k 
Models Estimate S.E. Estimate SiB: Estimate SEs 
Model 4 = = 4,715.64 431.00 = - 
Model 5 0.79 0.04 2,163.50 245.50 0.53 0.07 
Table 4 


Average EBLUP for MPCE (Rs.), their Estimated SE and CV Under Regression, 
Spatial, Regression Temporal and Spatial Temporal Models 


NSSO Rounds 
Models 50 51 52 53 54 55 
Average Small Area Estimates 
Model 1 ZIGAO” 32126 373.07 408.52 411.25 482.00 
Model 2 272.87 312.53 354.45 397.52 400.87 471.99 
Model 3 272.98 313.14 351.51 398.21 400.78 471.09 
Model 3A ZISiD6 314.19 352.01 396.40 399.91 471.91 
Model 4 274.13 305.62 345.54 383.53 399.56 463.32 
Model 5 213215 SM IPAOA| 351,79 391.61 399:50° = 473:57 
Average Standard Errors (SE) 
Model 1 25.09 66.06 64.18 74.19 53.87 45.45 
Model 2 17.10 33.65 29.09 39.85 32.68 30.59 
Model 3 16.88 32.84 OA \ tat 39.98 30.87 24.84 
Model 3A 16.56 31.29 20.79 40.03 30.23 24.37 
Model 4 19.51 34.91 35.19 37.79 35.14 SM he 
Model 5 17.18 28.99 28.33 30.02 28.76 28.10 
Average Coefficient of Variation (CV) (%) 
Model 1 9.09 20.56 17.20 18.16 13.10 9.43 
Model 2 6.27 10.79 8.21 10.01 8.15 6.48 
Model 3 6.18 10.49 6.12 10.04 7.70 52% 
Model 3A 6.05 9.96 5.91 10.10 7.56 = ie 
Model 4 2 4 11.42 10.18 9.85 8.79 7.15 
Model 5 6.28 9.29 8.05 7.67 7.20 5.93 
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Table 5 
Percentage Relative Efficiency [RMSE] of the Temporal Models in Comparison 
to other Models for MPCE 
NSSO Rounds 
50 a 52 ay) 54 55 
Spatial Temporal Model [Model 5] 
Model 2 123.63 170.54 193.68 203.55 204.72 169.76 
Model 3 100.24 133.82 149.70 165.46 165.85 154.23 
Model 4 125.81 141.50 141.93 T3755 139.11 129.88 
Regression Temporal Model [Model 4] 
Model 2 100.71 134.50 156.35 165.30 163.13 152.56 


have been applied and the small area mean estimators under 
each of them are obtained. While fitting the regression and 
spatial temporal models on the simulated datasets, the 
iterative maximisation process have the constrained value of 
k <=1. Here we have taken B=5,000 replications. The 
true MSEs of the estimators for i small area under a 
particular model (k = 2—4) may be defined as 


B “A 
MSE(6' ) ==> [6*(b)—9,(b)]’, i=1,2,...,m. 
k= 


The relative efficiency of the estimators under spatial 
temporal model (Model 5) against the estimators under 
models 2—4 have been judged by the ratio of their mean 
squared errors (RMSE) as 


m 


" MSE(6‘) 
RMSE en) 100 
Oe SEO.) 


where “Temp’ denotes the spatial temporal model and k 
denotes models 2, 3 and 4. Likewise the relative efficiency 
of the regression temporal model (Model 4) against the 
simple regression model (Model 2) has been found by 
simulating data with the estimated parameters given in 
Table 3, under the regression temporal model. The results 
have been shown in Table 5. 

The results confirm the superiority of the spatial temporal 
model in comparison to other models for these parameters. 
The regression temporal model has also been found better 
than the simple regression model. 


5. Conclusions 


The Direct survey estimates based on the small sample 
can be considerably improved by using the area specific 
small area models. The spatial autocorrelation amongst the 
neighboring areas may be exploited for improving the direct 
survey estimates. However, the model must be applied after 
studying the significant correlation amongst the small areas 
by virtue of their neighborhood effects. In case of poor 
relation between the dependent and exogenous variables, the 
simple spatial model with intercept only, may equally 


improve the estimates. This model uses only the spatial 
autocorrelation to strengthen the small area estimates and do 
not require the use of exogenous variables. The spatial 
models, by using the appropriate weight matrix W, or a 
combination of W matrices, can considerably improve the 
estimates. Weight matrix should be based on logical 
considerations and it may be used effectively for the cases, 
where due to some reasons, reliable exogenous variables are 
not available. This aspect can be further exploited to find out 
the small area estimates for the areas which have been 
recently created/demarcated. 

One has to be careful about the increase in the MSE due 
to the variability caused by replacing the parameters by their 
estimates. This gets reflected through the second order 
approximation to the MSE dealt in the paper. That is why 
many times the simple spatial model (with intercept) 
performs better than the spatial model involving more 
parameters. Use of time series data with fixed regression 
parameters across the time, further improves the small area 
estimates especially for the time points where the direct 
survey estimates have larger MSE. Spatial temporal models 
have advantage over temporal models without spatial 
consideration due to the inclusion of fixed spatial auto- 
correlation across the small areas. However, for some time 
points for which p may be very different than the rest, this 
may not hold due to estimates tending towards the average 
of five rounds. Here the temporal consideration can be 
started from a suitable initial time point. Finally the 
exogenous variables X and the weight matrix W supplement 
each other through the regression parameter B and the 
autocorrelation parameter p and a judicious use of them 
may result in considerable improvement in the small area 
estimates. 
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Appendix 


Theorem A.1: Under Regularity Conditions 1 


MSE[O(W)] = g, (YW) + 2,(W)+2,(W)t+o0(m"'). (5.1) 


For proof of the Theorem, we use the following well known 
results (Srivastawa and Tiwari 1976). Let U ~ N(O, x) 
then for the symmetric matrices A, B and C 


E{U(U’ AU)U’ ]=Trace(AX)z + 2ZAZ 

E({U(U' AU )(U’ BU)U" |= Trace(AX)Trace(BZ)= 

+ 2[Trace(AX)X Bz + Trace(BL)XAX + Trace(ALBX)xX] 
+ 4[Z ADB + TBLAY]. 


Proof of Theorem A.1 


Kackar and Harville (1984) showed that MSE[6()]= 
MSE [60y)] + ELH) — O(y)) OCH) - OCW)". It is 
straight forward to show that MSE[6(y)] = = 8 (y)+ 82(W). 
We need to prove that _g,(y) = E[@(p) — Oy) (6) - 
bcy))? ]+o(m"'). Taylor Series expansion of 8H), around 
w and using (W-w)=O0,(m") and (0’@(W))/ 


(dy ,0y.)|, argent) when Il —Hil< Iiv-wil we 
get 


[O~h) - O(y)] = LO — y) @ 1, J VOCy) +O, (m™'). (5.2) 


VOC) = (AO(W)) Ay) = [(A6()) /(O), (BBQY)) / 
. Using 


Here 


(do°)]' 


dW) _y» 90°(B,w) 
OW as OB, 

re eB 

where 6°(B,y) = XBQy) + AQY)[y— XB(y)], and the fact 


that (0B, (y)) Oy,) =O,(m™'/*) (Cox and Reid (1987), 
we get from the above 


[8(W) -— 8(y)] =[(W- yy’ @ 1, ]VO (yw) +O, (m™) 


Bw) , 36°(B.w) 


bes OW, OW, bes 


(5.3) 


where V0 (y) = eee ®. Ms ae. bw) fe =6(y) 


00° 
=L(w)y- X Boy). 
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Using the Regularity Conditions 1 and the fact that 
Boy) -B=0,(m™'"*) we have 


[8() — 8Cy)] 
=[(0-w)@1,, 7’ Lop ly — X Bey] + 0, (m™) 
2 A 
=> Gy -VWLiCWLy - X Boy) + O, (m*). 
d= 


Further using the Taylor Series expansion of the Likelihood 
S(f\) =0 around y where 


s(n) =187 (ny, S7 MI", S7(M) = cal 


and the orthogonality of 8 and wy, it follow that 
(h—y) =7,'(w)S,() +O, (m"). 
Writing 


Sy) = ColIS, (W)]=[5,(W), 5.2 Cyl’, 


Say) = of =- ised 5" 
oy 2 


d 


0d 1 
ay | ~ 5 lu’ By (w)u], 


d 


x',u=y—XB(y) and 


B,(w)= Dee = 


d 


oz ple = 
OW 4 Oy, 


I 4.(W) = Arise" 


we get 
[6(@) — Bay) = L’ (Wy Cy) @ 1, IIS, (y) @ uv] 
and thus the expression 


(60) — BQ) ][60) — 6Qy)]’ upto order o(m"') 
=L’ (Wily (y) ®I,,] Col [Sq (y)]Coneatl S, (y)u"] 


[Ty Cy) @ 1, ILCy) 
=U (wy (w)@I,,] ColConcat[uS, (W)S, (yu 


[Ty (y) @1,, ILC). 


Now we can write the likelihood and its derivative as 


(5.4) 


¢=log L=const. > lost 2 |]— swe 


of =~} Trae" = | 5H" B, (wu, 


OW d 
p= 0% ya 
OW 
ae 1 ig; paler) > 
E| - = —Trace| x" x —— |=1,,(w) 
oe 2 OVW, = gil 


where information matrix LyW) = 14.(W). 
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The expectation of a typical element of the inner most terms 
in the expression (5.4) becomes 


E{uS,(w)S,(y)u’ |= 
ulu’ B, (wul[u’ B,Qyw)ulu" 


Dy 


< rac" wa, (w)u lu" 


d 
E 
= ulu' B, (yaaa a 


e 


ue fre = zy 
OW, Ow, 


and by applying the results of Srivastawa and Tiwari (1976), 
it becomes 


+uU Teg 


E{uS,Qy)S,(y)u" |= 


Trae" 0% 54 Zh oz > =| 
2 OW, OW, ov,  oW, 


Substituting these in the expression (5.4) and also the 
second expression being of order O(m"'), we can get the 
following upto order o(m™') 


[8(H) — BC) ]L6(H) — BC) 
=L' (wy (y) @1 m] Col Concat[Z,, (y)2] 
[Ty (y) @7,, ILQy) 
=L' (wy (W) @ 1, Wy Cy) ® ZILy' (w) @ 1, ILOy) 
=L' (Wy Cy) ® ZIL(y). 
Theorem A.2: Under Regularity Conditions 1 
Els.) + 83(0) — 84(0)— 85H] = 8,(y) + 00m”), (5.5) 


Elg,(M1=g,(y)+o(m™), 


E[g3;(W]= 93(y)+0(m") (5.6) 


and Elg,(W)]= g5(y)+o(m *). (5.7) 


Proof of Theorem A.2 


Taylor Series expansion of g,(W) around w and using 
w-w=0,(m""") when Il —yil<=Ily—wyil, we get 
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2h =2, (w+ -ay’ @1,,1Ve,0w) 
La 2 
pai ea @1, V's, wly-wy) @/,] 


+0,(m~) 


Gp 
dg, (yw) dg, (W) 
V ™ 1 1 


0g, (y) 
: . aS WEL 
V?8,(w) = Col cone OW ,oV, 


0g, (y) = RY" Oo sip 


OW, OW 4 
2 
o's. (yw) Sp 33e. 2 yt 0% Sip 
dy OW, OW, oy, 
2 
+Rz> isl acailhess x 
OY OVW, 


Using the fact that X(y) and its derivatives are symmetric, 
we have the second term of the expression as 


[((-y)’ @1,1V's (WLOh-y) @/,,] 
=-L' (wy (w) ® ZILA) 

07d 
dyoy” 


~ Tie @(RZ")] [Ty (y) ® =n) 


=—23(W) + g5(y) 


where I y (W) =Var(y) is information matrix, the 
asymptotic variance of y. The first term in the expression 
[—w)’ @1,]Ve,(y) reduces to g,(y) because of 
EW -w) =), (yw) up to order o(m') (Peers and Iqbal 
1985). 

The second part of the Theorem follows from the Taylor 
series expansion of g,(), g,;() and g<(Wy), each around 
y and using W-y=O,(m"'"*) and (0’g,(y))/ 
(OW, OV,) lyag =O, (m™), (0° 85,0Y))/ OW OW.) lag = 
O,(m") and (0 gs (y)) OW, ) |,g=O, (mr), 
respectively where Il — wil< Il—wil. 


Theorem A.3: Under Regularity Conditions 2 


MSE((6,(Q)] = 2.5,(W) + 23,0) t+o(m"'). (5.8) 


Proof of Theorem A.3 


The proof is basically on the line of Theorem A.1 and with 
the use of the results of (Srivastawa and Tiwari (1976)) 
mentioned therein. 


MSE([6, ()] 
= MSE((6, (y)] + E([8, (y) — 9, )(8, (yw) —9,)" J 


= 215,(W) + EL, (yw) — 8, )(0, (y) —8,)° ]. (5.9) 
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Taylor series expansion of 8,(y) around y and using 


(h-y)=0,(m"'*) and (06(y)) (Ay, dy,) |g =O, () 
when Il —yil<lly—yll we have 
(8, - 8, 0y)] 
=[(h—y) @1,,]' V6,(y) +0, (m") 
3 
=> [WW -V Lawe,(W]+0,(m"). (5.10) 


d=l 


Further using the Taylor series expansion of the Likelihood 
equation S(f\)=0 and the orthogonality of B and y, it 
follows 


(y—w) =1,' Cy) SCy) +O, (m™). 


Substituting the expression for (W—w) in equation (5.10), 
we have up to order o(m') 


(6, (6, CWIEL; (WL CW) @T,, IES, (Y) ® e,] 


and 


(5.11) 


(5.12) 


[6,() - 6, cy) 6,0) - 6, 0W))" I 
=L' (wy (y) @1 n | Col Concat 


[e,S,WS. We; IY CW) @ 1, ILQY) (5.13) 


where 


of 
Sy) = aus . 


ay (y) = Col [S, (yw), 


d 


Using the expression for derivatives of likelihood, we have 


_ oH 
Trace[C,, (w)] — Dye Te | 
= OW a 


1 

Sa (yw) = 2 
di 

+> le: Ba Wwe] 
t= 


oa: ) ¢ 4 * On. 
+ 618; Joes ot ax) Rikon Hix; 
d 


d 
Bi W=He Oli en 
OW 


By applying the considerations e, ~ N(O,H,), Corr(e,, 
e,)=0 for i# j, Corr(e,, (de;)/(Oy,))=0 and Corr(e,, 
a) GEN ))=0 due to the fact that (de,) /(dw,) = 
(O(y, -U,G,,_,))(Ow,) being linear function of 
(Yi, Yo>---» ¥;4) iS uncorrelated with e,, we get the 
expectation of the inner most terms of the expression (5.13) 
as 
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Ele,S, (WS, WE; ]=K,.(W)H, + He oH 


OW OW, 


+ ih Trace[B,,(y)] on, + Trace[B,,(w)] alt 
o, OV, OW a 


f = Trace [B,, (W)] Trace[B,, (W)JH, 


where 


gy OL, ay OLN 
K,.w) == £3 Tace in He 


d 


The middle three terms in ths expression being of order 
O(1) which along with J, Ww) in the expression given 
below makes them of order ris Lak 


EL(8, () — 8, (y))(8, CH) - 8, (Y))"1= 85, 0W) 
=L' (wy (y) @ 1, [K, (yw) ® H,] 
[Zy' Cy) @ 1, ILQy) + o(m") 

=L' (wy (WK, (wl, (w) ® H, JLQy) + o(m™). 
Theorem A.4: Under Regularity Conditions 2 

EL 8 10,0) + 83,0W) + 831,W) — 84H) - 85,0)] 

= 81, (W) +o(m™) 

Elg3,(W)] = g3,(y) + o(m™) 

and 
Elgs,(Q)1= 85,(w)+o(m”). 

Proof of Theorem A.4 


The proof is essentially based on the line suggested in 
proving Theorem A.2. Using Taylor series expansion of 
21>,() around y, we get 


81, W) = 8, W) +1 —-y) @ BL Vou (y) 
tae % 
+S[e— wy @1,1V 81, WIp—y) @1,,] 
+0,(m") 


O81, (W) 
) 


d 


Ve7,(W) = Col [V 8:00 OW), Voi OW) = 


8121 CW) 
Vv’ 8121 (W)= ar Col Conca oy ,oW, Feut 
988i (W) _ py OR yap 
Ow, OW, 

Fa _ ppt 2 51 2 sag 
OW OW, Wa : 

+ REO 9X vip 

OW ,OW, 
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Using the fact that X(y) and its derivatives are symmetric, 
we have the second term of the expression as 


lp —w)’ @1,,1V" 2, Wleh—w) @7,,] 
= -L'(wlly (Wy) @ZIL(y) 
0°y 


yay lv (y) @(2"R)] 


+ 5 Trace [1, ®(RZ")] 


= —83,(W) = 85, (W) 


where J,'(y)=Var(y) is the asymptotic variance of vw. 
The first term in the expression [(/—w)’ ® Tin WV 819; W) 
reduces to g,,(W) because of E(y—w)= b,(y) up to 
order o(m™) (Peers and Iqbal 1985). 

The second part of the Theorem follows from the Taylor 
series expansion of g,,() and g.,(w), each around wy 
and using W-y=O,(m"*) and  (0°g,,(y))/ 
(Wg A.) lyag = Op(m"') and (0g, (W))/ 
(OW ,0W,) lyg= O,,(m™), respectively where Il — (ils 
Iw-—wll. 
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Modeling and Estimation Methods for Household Size in the Presence of 
Nonignorable Nonresponse Applied to the Norwegian Consumer 
Expenditure Survey 


Liv Belsby, Jan Bjgrnstad and Li-Chun Zhang ' 


Abstract 


This paper considers the problem of estimating, in the presence of considerable nonignorable nonresponse, the number of 
private households of various sizes and the total number of households in Norway. The approach is model-based with a 
population model for household size given registered family size. We account for possible nonresponse biases by modeling 
the response mechanism conditional on household size. Various models are evaluated together with a maximum likelihood 
estimator and imputation-based poststratification. Comparisons are made with pure poststratification using registered family 
size as stratifier and estimation methods used in official statistics for The Norwegian Consumer Expenditure Survey. The 
study indicates that a modeling approach, including response modeling, poststratification and imputation are important 


ingredients for a satisfactory approach. 


Key Words: Household size; Nonresponse; Imputation; Poststratification. 


1. Introduction 


This work is motivated by the considerable nonresponse 
rate in the Norwegian Consumer Expenditure Surveys 
(CES) for private households, for example 32% in the 1992 
survey. Nonresponse involves both noncontact and refusal. 
We focus on the problem of nonignorable nonresponse that 
occurs when estimating the number of households of 
various sizes and the total number of households. 

We shall consider a completely model-based approach; 
modeling and estimating the distribution of household size 
given registered family size and the response mechanism 
conditional on the household size. This model takes into 
account that the nonresponse mechanism may be nonigno- 
rable, in the sense that the probability of response is allowed 
to depend on the size of the household. The response model 
is used to correct for nonresponse. Model-based approaches 
with nonresponse included, sometimes called the prediction 
approach, have been considered by, among others, Little 
(1982), Greenlees, Reece and Zieschang (1982), Baker and 
Laird (1988), Bjgrnstad and Walsge (1991), Bjgrnstad and 
Skjold (1992) and Forster and Smith (1998). 

For various models of household size and response we 
consider mainly two model-based approaches, a maximum 
likelihood estimator and imputation-based poststratification 
after registered family size. These methods are compared to 
pure poststratification and the methods in current use in 
CES. 


The main issue here is a comparison of models and 
methods with estimation bias as the basic problem. In 
addition, standard errors of the estimates and differences of 
the estimates, conditional on the sizes of post-strata 
determined by family size, are estimated using a bootstrap 
approach. In addition to assessing the statistical uncertainty 
of the estimators, this is done to help evaluate the extent to 
which differences between the proposed estimators are 
attributable to sampling error, nonresponse bias or both. 
However, in this evaluation we keep in mind the following 
quote from Little and Rubin (1987, page 67): “It is impor- 
tant to emphasize that in many applications the issue of 
nonresponse bias is often more crucial than that of variance. 
In fact, it has been argued that providing a valid estimate of 
sampling variance is worse than providing no estimate if the 
estimator has a large bias, which dominates the mean 
squared error.” 

Section 2 describes the data-structure and the sample 
design of CES, and Section 3 considers modeling issues. 
Section 3.1 presents the various models for household size 
and response to be considered for the 1992 CES, Section 3.2 
describes the maximum likelihood method for parameter 
estimation, and in Section 3.3 the models are evaluated. A 
family size group model for household size and a logistic 
link for the response probability using household size as a 
categorical variable give the best fit of the models under 
consideration. Section 3.4 gives the estimated household 
size distributions for different family sizes and estimated 
response probabilities for different household sizes. 


1. Liv Belsby, Statistics Norway, Division of Statistical Methods and Standards, P.O. Box 8131 Dep., N-0033 Oslo. E-mail: Ibe@ssb.no; Jan F. Bjgrnstad, 
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Section 4 considers model-based estimation, the impu- 
tation method, imputation-based estimators and the variance 
estimation method. It is shown that for the chosen model for 
household size from Section 3.3, the maximum likelihood 
estimator and the imputation-based poststratified estimator 
are identical. 

Section 5 deals with the main goal of estimating the total 
number of household of various sizes based on the 1992 
CES, using the estimators in Section 4. The model that gave 
the best fit seems to work well for our estimation problem. 
We conclude that poststratification, response modeling and 
imputation are key ingredients for a satisfactory approach. 


2. Norwegian Consumer Expenditure Survey 


The population totals within household-size categories 
provide a more correct number of dwellings than the totals 
within family-size categories from the Norwegian Family 
Register. Furthermore, the authorities for evaluating even- 
tual policy intervention aimed at housing construction use 
the estimated number of households. Estimating household- 
size totals is therefore an important issue in social planning. 
It is invariably affected by nonignorable nonresponse, no 
matter what kind of survey one uses. Hence, it is a good 
illustration for how to handle nonresponse bias. We shall 
base our estimation on the Norwegian Consumer Expen- 
diture Surveys (CES), where it is important to gain infor- 
mation about the composition of households, since house- 
hold size influences consumption. 

The actual CES, the survey for expenditure variables, is a 
sample of private households from all private households in 
Norway. This is done by selecting a sample of persons and 
including the whole households these persons belong to. 
Persons older than 80 years old are excluded since they 
often live in institutions. For our purpose, the units of inter- 
est in the survey are persons between the ages of 16 and 80 
living in private households, and the variable of interest is 
the size of the household the person belongs to, which is 
observed only in the response sample of the persons 
selected. 

The sample design is a three-stage self-weighting sample 
of persons. That is, every person in the population has the 
same inclusion probability to the total sample. The first two 
stages select geographical areas in a stratified way, while at 
the third stage persons are selected randomly from the 
chosen geographical areas. The primary sampling units 
(PSU) at stage 1 consists of the municipalities in Norway. 
Municipalities with less than 3,000 inhabitants are grouped 
together such that each PSU consists of at least 3,000 
persons. The PSUs are first grouped into 10 regions and 
within each region stratified according to size (number of 
inhabitants) and type of municipality (ie., industrial 
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structure and centrality). Totally, we have 102 strata. Towns 
of more than 30,000 inhabitants are their own strata and 
therefore selected with certainty at stage 1. For the other 
strata, one PSU is selected with probability proportional to 
size. At the second stage, the selected PSUs are divided into 
three smaller areas (secondary sampling units, SSU) and 
one of these is selected at random. Finally, at the third stage, 
for each of the selected SSU, a random sample of persons is 
selected. The sample sizes for each selected SSU are 
determined such that the resulting total sample of persons is 
self-weighting. 

Our application is based on the data from the 1992 CES. 
CES is a yearly survey and since 1992 a modified Horvitz- 
Thompson estimator, including a correction for nonresponse 
by estimating response probabilities given household size, 
has been employed (see Belsby 1995). The weights equal 
the inverse of the probability of being selected multiplied 
with the conditional probability of response given selected. 
Since 1993 the probability of response is estimated with a 
logistic model with auxiliary variables being place of 
residence (rural/urban), and household size. For most of the 
nonrespondents the family size is used as a substitute for the 
household size. 

A household is defined as persons having a common 
dwelling and sharing at least one meal each day (having 
common board). For a complete description of CES we 
refer to Statistics Norway (1996). In CES, the auxiliary 
variables known for the total sample, including the 
nonrespondents, are the family size, the time of the survey 
(summer/not summer), and the place of residence (urban/ 
rural). Families are registered in Norwegian Family Reg- 
ister, (VFR), and may differ from the household the persons 
in the family belong to, both by definition and because of 
changes not yet registered. Hence, the registered family size 
from NFR differs to some extent from the household size. 
Initially, based on experience from previous surveys, all the 
auxiliary variables and household size are assumed to affect 
the response rate. 

Table 1 shows the data for the 1992 CES with a total 
sample of 1,698 persons. The households with size five and 
greater are collapsed due to the low frequency in the sample 
of households. We base our modeling and estimation on two 
corresponding tables, one for the persons in rural areas and 
one for the persons in urban areas. These data are given in 
table Al in appendix A1. 

For example, the number 48 in cell (1,2) means that of 
the 162 persons registered to live alone in the response 
sample, 48 are actually living in a two-persons household. 
This is explained mostly by young people’s tendency to 
cohabitate without being married; see Keilman and 
Brunborg (1995). 
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Table 1 
Family and household sizes for the 1992 Norwegian Consumer Expenditure Survey 


Household size 


Family size 1 B) 3 4 
i 83 48 20 9 
2 9 177 By) 4 
3 10 25 131 40 
4 2, 13 3Y/ 231 
SS 1 4 4 17 
Total 105 267 229 301 


3. Modeling of Household Size and Nonresponse 


We shall assume a population model for the household 
size, given auxiliary variables, i.e., we model the conditional 
probability. To take nonresponse into account in the statis- 
tical analysis, we must model the response mechanism, i.e., 
the distribution of response conditional on the household 
size and auxiliary variables. The sampling mechanism for 
persons is ignorable for the survey we consider, i.e., is 
independent of the population vector of household sizes. 
The statistical analysis is therefore done conditional on the 
total sample, following the likelihood principle (see 
Bj@rnstad 1996). Hence, probability considerations based on 
the sampling design is irrelevant in the statistical analysis. 
This is the so-called prediction approach. However, when 
evaluating the estimation methods with regard to statistical 
uncertainty, we do this from a common randomization per- 
spective as described in Section 4.3. 

For CES, the auxiliary vector consists of the family size, 
place of residence divided into rural and urban areas, and 
time of the data collection. 


3.1 The Models 


Let us first consider a simple model for the household 
size, denoted by Y. Let x denote all auxiliary variables. The 
household size is assumed to depend only on the family size 
x, and as such is a model with a restricted parametric link 
function, but with no additional assumptions, 


P(Y,= ylx, =P, =ylx, =p,» (3.1) 


where 


o> (Py tie 1, for each possible value of x;,. 
y 


The model (3.1) is flexible in the sense that it does not 
include any restrictions on the assumed model function of 
x;. The drawback is the high number of parameters 
compared with a model using a logistic type model with a 
linear, in x, link function (the function linking P(Y = 
y) with x). If nonresponse is ignored the estimates in this 
model would simply be the observed rates. 


5 eLotal Nonresponse Response rate 
2 162 153 0.514 
3 230 160 0.590 
6 Vi2 91 0.700 
ie 300 123 0.709 
181 207 60 0.775 
209 ele Wal 587 0.654 


Household size defines ordered categories. Thus a natural 
choice for a model is the cumulative logit model, known as 
the proportional-odds model (see McCullagh and Nelder 
1991), assuming (with 9, increasing in y) 

ih ON ie 2 i 

P(Y, <ylx)=4 1+ exp(-6, + B'x) 

l for'y25. 


for y=1,2,3,4 


However, a goodness of fit test, with x consisting of 
family size and place of residence, indicated that this model 
fits the data badly. Thus we choose to reject it. 

It is assumed that the probability of nonresponse may 
depend on the household size. For example, one-person 
households are less likely to respond than households of 
larger size since larger households are easier to “find at 
home’. Nonresponse is indicated by the variable R, where 
R, =1 if person 7 responds and 0 otherwise. Let R, be the 
vector of these indicators in the total sample. From 
Bjgrnstad (1996), the response mechanism (RM), i.e., the 
conditional distribution of R, given the x—values in the 
population and the y—values in the total sample, is defined 
to be ignorable if it can be discarded in a likelihood-based 
analysis. This means that RM is ignorable if this conditional 
distribution of R, does not depend on the unobserved 
y—values, coinciding with the definition used by Little and 
Rubin (1987, pages 90, 218). For our case it is assumed that 
all pairs (Y,, R;) are independent. Then RM is ignorable if 
Y, and R, are independent. Hence, nonignorable response 


L 


mechanism is equivalent to 


PO, =y; ike; 1 =0)# P(Y, =y; bx Sue =) 


I 


and then both are different from P(Y; = y; |x; ). 


Thus estimating the parameters in the model for P(Y = 
ylx) using only the response sample, ignoring that the 
probability of response depends on the household size, would 
most likely give biased estimates for the unknown para- 
meters. Also the poststratification estimator would give 
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biased estimates because it assumes that the distribution of R 
only depends on the auxiliary x. E.g., the observed lower 
response rate among one-person families indicates that the 
same may hold for one-person households. If so, the esti- 
mated probability of household size 1, based on respondents 
only, would be too small. Poststratification with respect to 
family size will most likely correct only some of this bias. 

The model for the probability of response, given 
auxiliary variables and household size y,, is assumed to be 
logistic. It depends on the auxiliary variables z,, which 
includes part of x,;, expressed by 


RMI1(y, z): 

ee Me ee ee 

1+exp(-a — yy; —W'Z; ) 
Here, a and y are scalar parameters and y is a vector. 

The variable y, has an order. Motivated by this fact, and to 

avoid introducing many parameters, y, is used in (3.2) as 

an ordinal variable rather than a class variable. Thus the 

logit function, 

log{ P(R; =ll y,,z; )/P(R; =Oly; ,z;)}=a 


PLR teal ata . G.2) 


t 
sd Mat nah. A eg 
is linear in y;. To avoid the assumption of linear logit in 
y;. we also consider a model with y, as a categorical 
variable, i.e., 


RM2(y, z): P(R; =1l y;,Z; = 
Pere lace oeerMiatlcinl tall can 
—O, 0,1; (y,) = O51, (9) 


P33) 
Les 
Tiled oes OG, 1 ay ice Maz 


where the indicator variable J, (y; ) equals 1 if y, = y and 
0 otherwise. The drawback with this model is that it 
includes three parameters more than model (3.2). 


3.2 Maximum Likelihood Parameter Estimation 


All the selected persons in the sample are from different 
households (duplicates have been removed), The population 
model then assumes that the household sizes Y, are 
statistically independent. For this variable, interviewer- or 
cluster- effect plays no role. 

Let us consider the likelihood function for estimating the 
unknown parameters, assuming that all pairs (Y,, R; ) are 
independent and response model RMI given by (3.2). To 
simplify notation we relabel the observations such that 
observations 1 to n, are the respondents and observations 
n, +1 to n are the nonrespondents. With response model 
RM2 the expression for the likelihood is of the same form 
with (3.3) replacing (3.2). 

For the respondents let L, = P(Y; = y; OR, =11x;). 
Then, for model (3.1) 
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im t Py.x.> (3.4) 
Lrexp@O=yy,-W zp) wi 


For the nonrespondents let L; = P(R; =01x; ). Then 


5 L 
Ley 


oe (3.5) 
yal I+exp(a+yy+w'z; ) 


‘Py bn, + lye Mt 


The likelihood function for the entire sample of persons 
from different households is given by 


L(0,B, a, y,w)=[],_,L;- (3.6) 


For i=1, ..., n,, L; is according to (3.4) and for i= 
Ti, tl viento tly eLgulS SLVED DY.63). 

Estimates are found by maximizing the likelihood 
function (3.6). The maximization was done numerically 
using the software TSP (1991) see Hall, Cummins and 
Schnake (1991). The optimizing algorithm is a standard 
gradient method, using the analytical first and second 
derivatives. These are obtained by the program, saving us a 
substantial piece of programming. The model fitting is 
based on the chi-square statistic and on the f—values, 
provided by TSP, where the standard errors are derived 
from the analytical second derivatives. The t—values have 
to be interpreted with some care, since the unbiasedness of 
the estimated standard errors depends on how well the 
model is specified as well as the number of observations 
compared with the number of parameters. 


3.3 Evaluation of the Models for Household Size and 
Response 


We present the fit of the models with the Pearson 
goodness-of-fit statistics. The model study is based on the 
1992 CES. The parameters are considered to be significant 
when the absolute t—values are greater than 2. However, 
we do not want a model that is too restrictive, and therefore 
some variables are kept even though their absolute 
t—values are less than 2. 

In the response models RM1 and RM2 we use the 
variable z =z, place of residence. We let z =0 if rural area 
and z=1 if urban area. It was observed in the CES 
1986—88 and CES 1992 — 94, see Statistics Norway (1990, 
1996), that there is more nonresponse during the summer. 
Therefore, the time of the survey was also included in the 
model, that is whether or not the data were collected in the 
period May 21-— August 12. However, the time of the 
survey was found to be nonsignificant, with ft—value 
clearly less than 2. Also the family size was found to be 
nonsignificant. But if the household size is omitted in the 
response model then the family size turns out to be 


significant. 
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Ideally, we want to take a look at the empirical logit 
function for response with respect to the household size. 
However, household size is unavailable for the non- 
respondents. As a replacement we plot the logit-function 
against the family size; see figure 1. From family size one to 
two the two functions for rural and urban families increase 
in a fairly parallel way . However, for family size three and 
four the logit functions depart from being linear and parallel. 
Thus we suspect that coding the household size as a 
categorial variable, as in model RM2, will give better fit 
than restricting the logit functions to be parallel for rural and 
urban and linear with respect to the household size, as in 
model RM1. 

In order to test the goodness of fit of the models, we 
consider the Pearson chi-square statistic, conditional on the 
auxiliary variables x, z. Given rural or urban type of 
residence and registered family size, there are six possible 
outcomes; household sizes 1,...,5 and nonresponse. 
Altogether there are ten multinomial trials and sixty cells. 
For family sizes (1,2) and (4,5), the extreme household sizes 
(4,5) and (1,2), respectively, are combined because the 
expected sizes under the models are too small. This reduces 
the number of cells to 52. The degrees of freedom (d.f.) is 


log(response rate/nonresponse rate) 
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calculated as: number of cells — number of trials — number 
of parameters. For model (3.1) & RM1(y, z), df. = 52- 
10—(20:+ 3) = 19, and» for GA) & RM2(y,z), df = 
52— 10—(20+ 6)=16. For model (3.1) & RM1(y, z) 
the Pearson statistic y* is 26.35 and the p—value is 0.121. 
And for model (3.1) & RM2(y, z) ye is 21.77 and the 
p- value is 0.151. 

By studying the standardized residuals, (observed- 
expected) /~ Var( observed), we find that the main reason 
for the better fit is that model (3.1) & RM2(y, z) does a 
better job of predicting the observed counts for the urban 
area where the response rate is lowest (see appendix A1). 
Thus the data indicates that coding the household size as a 
categorial variable, as in RM2, improves the fit compared to 
using it as an ordinal variable. The model (3.1), with the 
restricted parametric link function, combined with RM2 is 
the best of the models we have considered so far. 


3.4 Estimated Household Size Distribution and 
Response Probabilities 


Table 2 displays the estimates for the population model 
(3.1) together with the logistic response model RM2 in 
(3.3). 


1 2) 3 4 5 
family size 
Figure 1. The logit function for the empirical response rate with respect to family size 1, ..., 5 in urban and rural areas, 


respectively. The computation is based on respondents and nonrespondents from Table 1 in Appendix Al. 


Table 2 
1992 CES. Parameter Estimates, in Percentages, for the Population Model with a Restricted Parametric 
Link Function, p,, ,, Combined with the Logistic Response Model RM2 (y, z). In Parentheses 
are the Estimates for the Population Model, Ignoring the Response Mechanism 


Household size 
Family size, x 1 2, 3 4 5 or more 
1 60,01 (51-23) 26:75: (29.63) S35 (12-85) 4.09 (5.56) 0.80 (1.23) 
2 5.27 (3.91) 79.80 (76.98) 12.48 (16.09) 1.47 (1.74) 0.98 (1.30) 
3 IB 44a2) 14.45 (11.79) 56.67 (61.79) 18.85 (18.87) 2.50 (2.83) 
4 1.06 (0.67) 5.31 (4.33) 17,38(12.33).. 7720 (77,00) 5.05 (5.67) 
5 or more 0.84 (0.48) 2.60 (1.93) 1.96 (1.93) 9.05 (8.21) 85.55 (87.44) 
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Let us interpret some of the values in the household 
model. Taking the response mechanism into account has 
largest effect on the estimated household distribution for 
one-person families. The probability that a household size 
equals one, given that the family size is one, is estimated as 
60.01%. The estimate based on the traditional approach, 
ignoring the nonresponse, is 51.23%. The response model 
“adjusts” the observed rate among the respondents to a 
higher value. This seems reasonable since the rate of non- 
respondents is higher for small households. The estimated 
probability of household size five or more, given family size 
of five or more is 85.55%, which differs little from the 
observed rate among the respondents, 87.44%. This 
indicates that, given family size five or more, the household 
size distribution is about the same among respondents and 
nonrespondents. 

Table 3 presents the estimated response probabilities 
based on RM2 in combination with the population model 
(3.1). Furthermore, we present estimated response proba- 
bilities based on a saturated model, with perfect fit, 
presented in Section 4.2. The model, defined by (4.9), 
assumes that the response probability for persons with the 
same household size within rural/urban area, respectively, is 
identical for different family sizes. Moreover, the model for 
household size depends on place of residence and family 
size, but with no restriction on the link function. We note 
that RM2(y, z) satisfies (4.9b), but is more restrictive. 
Model (4.9) allows for more freedom than model (3.1) with 
RM2(y..z). 

Table 3 
Estimated Probability of Response Based on the Logistic 


Model RM2 in Combination with (3.1), and the Saturated 
Model (4.9). The Estimates are Given in Percentages 


Household size 


Place of residence 1 2 3 4 5ormore 

Estimated response probabilities for 
model RM2 

Rural ATT Tem 6090 el DAG need ScD Bene Slice 

Urban 38.92 52.04 72.44 65.62 75.46 
Estimated response probabilities for 

the saturated model 
Rural 507946237 “716.905, LODg4 ¢83.07. 
Urban 35,1E. 5085-9 T4791 I0.68) 27289 


The estimated response probabilities reflect the lower 
response rate among one-person households, and the lower 
response rate in urban areas. Households of size five and 
higher have the highest response rate. The models estimate, 
surprisingly maybe, that the the probability of response is 
higher for households of size three than for households of 
size four. This may be explained by the fact that women 
often choose to have two children, and that three-person- 
households mostly consist of mother, father and a small 
child. Such a family will tend to stay at home and thus be 
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more accessible than a typical four-persons-family with two 
older children. 

The higher estimated response rate for households of size 
three compared to size four is equivalent to the ratio 
P(Y =3|R=1)/P(Y =3|R=0) being greater than the ratio 
P(Y =4|1R=1)/P(Y =4|R=0). This is consistent with 
the household distribution in table 2, where we estimate that 
P(Y=4)2P (Y= 41 R=)lpteaePWr=41R=0)= Pe 
4|R=1). On the other hand, the estimates in table 2 indicate 
that P(Y =3|R=1)>P(Y =3) which means that P(Y = 
SUR =1) S>POS=31R=0)1 

We see that the logistic model RM2 combined with the 
population model with the restricted parametric link p, , 
acts as a smoother of the estimates based on the saturated 
model in (4.9), because of the added assumption of parallel 
logits of the response probabilities for urban and rural areas. 


4. Estimators for Household Size Totals 


In this section we present the estimators for household 
size totals and the method for variance estimation. We use a 
maximum likelihood estimator with the restricted para- 
metric link function in (3.1) as population model. It is 
shown that this estimator is identical to an imputation-based 
poststratified estimator, which again turns out as a standard 
poststratification when the response mechanism is ignored. 
Furthermore, we present an imputed poststratified estimator, 
based on a saturated model for household size and response 
probability. 


4.1 Estimators Based on a Restricted Parametric 
Link Function as Population Model 


With N, denoting the total number of persons living in 
households of size y, the number of households of size y 
equals H,=N,./y. The total number of households is 
denoted by H, H =i) Hd ys 

The statistical problem is to estimate H, for 
y=l,..., J and H. The largest size J is chosen such that 
there are few households of size greater than J. Strictly 
speaking, H, is the number of households of size J or 
more, and likewise for N,. In our application we choose 
J =5 due to the low frequency in the sample of households 
of size greater than five. We can write N= >).,/(Y;=y), 
where the indicator function /(Y,; = y)=1 if Y, =y, and0 
otherwise. Hence, with x =(X,, ..., Xv), 


N 
E(H, Ix)=- SPY, Sixe) 
i=1 
A maximum likelihood based estimator for H , can be 
obtained by estimating E(H,1|x), ie, replacing 
P(Y, = y|x;) by the maximum likelihood estimator 
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Py ;=y|x,). The data is stratified according to family 
sizes 1, ..., K, where the last category contains persons 
belonging to families of sizes => K. Using the model with 
the restricted parametric link function, defined in (3.1), Y is 
assumed to depend only on the family size x, and the 
estimator takes the form 

Pees Sieh ain) (4.1) 

y y x=lo > * 

where M,(M, ) denotes the number of persons in the 
population with registered family size x(2K). The M,’s 
are known auxiliary information from the Norwegian 
Family Register. 

A common approach to correct for nonresponse is by 
imputation of the missing values in the sample. Based on 
the estimated distribution for Y for a given family size and 
place of residence for the nonrespondents, P(Y = Vix. 
z, r=0), we assign the nonrespondents to the values 
1, ..., 5 in proportions given by P(Y= Vinx,z.7 =O) for 
y=1,...,5.. Let n,, (0) (n,, (1) be the number of 
imputed values with family size x and household size y, 
for rural (urban) areas and let m,, (0) (m,, (1)) be the 
number of missing observations for persons in rural (urban) 
areas with family size x. Then 


Peon mcaier CY Sylar OU), 2=0, Lord?) 


and 

ny =n, (0) +n, (1) 
is the total number of imputed values with family size x 
and household size y, ie., a is the estimated expected 
number of households of size y, given family size x and 
r=0. 

The following general result holds, showing that with 
population model (3.1), the maximum likelihood estimator 
(4.1) is identical to an imputation-based poststratified 
estimator. 


Theorem. Assume model (3.1) for Y. That is, P(Y = 
ylx, Z)= p,, 1S independent of z, but otherwise the 
Py © afe completely unknown with the only restriction 
i, P,,, =1, for all values of x. The response mechanism 
is arbitrarily parametrized, i.e., no assumption is made about 
P(R=1|Y=y, x, z). Then the maximum likelihood 
estimates for p ane given by, for x=1, ..., K, 


Hier Ny, 
a Or Rey 
where n,, is the number of respondents belonging to a 
family of size x and household size y, m,(m,) is the 
number of respondents belonging to families of size 
x(2 K ),vand’m »=m,,, (0)+myz, (1): 


XU 
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Proof. See Appendix A2. 


The theorem implies that the estimator can be written as the 
imputation-based poststratified estimator, using family size 
as the stratifying variable , 


n * 

I = K xy xy 

1p post =—)" MM, : 
y A m 


Assuming ignorable response mechanism and using the 
model (3.1), the likelihood function is given by 

w, P(Y; =y;, |x; ). Then the maximum likelihood esti- 
mate P(Y = ylx) is simply the observed rate among the 
respondents with household size y, given family size x. 
Thus the maximum likelihood estimator turns out to be 
identical to the standard poststratified estimator, with family 
size as the stratifying variable, 


a 1 K Nyy 
ee post Bi oei ke a. 


For a general study of poststratification see, for example 
Holt and Smith (1979) and Sarndal, Swensson and Wretman 
(1992, chapter 7.6). 

To illustrate the effects of nonresponse modeling and 
poststratification, we also present estimates based on the 
regular expansion estimator, given by 


(4.3) 


(4.4) 


(4.5) 


ne nS 


(4.6) 


Here, ny, is the number of respondents in households of 
size y,n, is the total number of respondents, and 
nN, = XN. The estimator (4.5) does not seek to correct 
for nonresponse nor use the family population distribution 
as a post-stratifying tool to improve the estimation, while 
estimator (4.6) tries to take the response mechanism into 
account, but cannot correct for nonrepresentative samples. 


4.2 Imputation-based Poststratification with a 
Saturated Model 


We now proceed to an intuitive method of imputation 
that was used to estimate response probabilities for a 
modified Horvitz-Thompson estimator in the official 
statistics from the 1992 CES (described in Belsby 1995). 
We will use this imputation method for the poststratified 
estimator (4.3). 

The imputation method consists of distributing, within 
rural/urban area, the m,,,(z) nonresponse units over the 
household sizes 1,...5 in such a way that, given 
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household size, the rate of nonresponse is the same for all 
family sizes. It implicitly assumes that the response 
probability for persons with the same household size within 
rural/urban area is identical for different family sizes. 
Denote the number of nonresponse persons with family size 
x and household size y and place of residence z obtained 
in this manner by h,,(z). The corresponding number 
among the respondents is n,,(z). The values igh i 2) 
are determined by the equations 


hs’) hy (Z) 


ea 2,0): 
LES nh OCA CAR RK ea: 


(4.7) 
When 7n,,(z)=0, we let h,,(z)=0. The equation 
(4.7) is solved under the conditions 


> A(z) =m,,(z);x=1, 2,3,4,5 and z=0,1. 4.8) 
¥. 


Solving (4.7) and (4.8) requires, for each value of z, one 
row (n,,(Z), nN, (Z),--» 1,5(Z)) of nonzeros, which 
holds for our case. The imputed values h,, (z) determined 
by (4.7) and (4.8) correspond to the imputation method 
described by (4.2) for the following model: 


P(Y=ylx,z)=p, ,, with no restrictions (4.9a) 


P(R=11Y=y,x,z)=q,, ,, independent of x. (4.9b) 


This can be seen as follows: 

For the ten multinomial trials determined by the different 
(x, z)— values, we have 50 unknown cell probabilities 
TN, =P(Y=y, R=11x, z). With no restrictions on cell 
probabilities, the maximum likelihood estimates (mle) are 
given by observed relative frequencies, 


: n, (2) 
m,(z)+m,, (z) 


A 


Toy 2 


This also holds when n,,,(z)=0. Now, it can be shown 
that there is a one-to-one correspondence between a= 
(%, %,) and (Po, Yo. Pir 1), Where 7, =(T,, .: y= 
Ty see Sy = dy cacy J) ip ee gs ans SD) 
and g, =(q,,, --> 4s,,): Since Te ee ae 


mle of p,,., and qg, , must satisfy 
Ny (Z) 

_= (4.10) 

~ MAZemn e) 


and are uniquely determined by 7,,. .. 

Consider h,,(z), given by (45) & (4.6). Let 
h, (z)=Zh,, (z) and n, (2) =24n,, (2). From (4.7), 
h,; (Z) 


h,(z) 
Li ALAL Lh ee 0. (4.11 
h,(z)+n,(z) hy (z)+ny(z) if n,,(z)>0. (4.11) 
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From (4.10) and (4.11) we have that the following 
intuitive estimates also are mle. 
m WAZ) 
4,,, =—-———~ (4.12) 
n, (z) +h, (z) 
ANGLO) TD) 
m, (z)+m,, (Z) 
(also when n,, (z) =h,, (z) =0). 
(We can also show (4.12) and (4.13) by maximizing the 
loglikelihood directly.) Next, we show that the imputed 
values (4.2) for the model (4.9) equal hy (z). From (4.2), 
we have n,,(z)=m,, (Z)- P(Y = ylx,zr=0). Under 
model (4.9) and estimates (4.12) and (4.13), we find that 


P(Y = ylx,z,R=0)= 


A 


yeceZ 


(4.13) 


PU =ylx, 2-2 (Yay, Role 
P(R=0!Ix, z) 
ae Pine mee 
Leap at uber 
alli Z) the (2) Smee Wee) 


m,, (Z) sl i 


and it follows that n¥,(z)=h,,(z). If n,,(z)=0, then 
Py, =h,,, =0, and n,,(z)=0. We note that model 
(4.9) is saturated and will, from (4.10), give perfect fit. 

The imputation-based expansion estimates (4.6), with 
model (4.9), are identical to the modified Horvitz- 
Thompson estimates with ¢, , = n,(z)/[n,(z) +n, (z)] 
(from (4.12)) as the estimated response probabilities, used in 
the official statistics from the 1992 CES. This follows from 
the fact that the modified Horvitz-Thompson estimator of 
N, is given by 

Nur 3 Sys ate 
1ES; Tt; 
where 1, =P ( person i is selected to the sample and 
responds). Hence, 


nN x n 
t,=—P(R.=1\x.,2,7, =y=—q,,, 
ards Seep ome 


and 
Nein -M(% oF + pe =| (4.14) 
N\ Gyo qyi 

Here, 
Nee za 

( n, (0) F n, (1) | 

n\n, (0)/M(n, (0)+n* (0)) ny (An, (1) +n* (1)) 

n, ae 
z n 
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So this modified Horvitz-Thompson estimator suffers 
from the same negative feature as the imputation-based 
expansion estimator (4.6); it cannot correct for the bias in an 
unrepresentative sample. For a general description of the 
modified Horvitz-Thompson method see, e.g., Sarndal et al. 
(1992, chapter 15). 


4.3 Variance Estimation 


Variance estimation of the various estimates are obtained 
by bootstrapping. It can be carried out under the modeling 
or quasi-randomization framework (Little and Rubin 1987). 
For instance, to estimate the variance under model (3.1) and 
RM1 (3.2), we may apply the parametric bootstrap with the 
estimated parameters (Efron and Tibshirani 1993). How- 
ever, it is not clear how to compare the variances estimated 
under the alternative models. We have therefore chosen to 
estimate the variances of the different estimators under a 
common quasi-randomization framework. We assume 
simple random sampling conditional to the family size, 
which is the only assumption we make for variance 
estimation. Unconditionally we have a self-weighting, but 
not simple random, sample, and therefore this is a rather 
crude approximation to the actual conditional sampling 
design. However, for a comparative study of the estimators 
the approximation will serve this purpose well. The 
nonresponse indicator 7, is considered to be a constant 
associated with person i. We draw the bootstrap sample, 
resampling (y,,2Z;,7%; =1), (z;,r,=0) randomly with 
replacement, as described by Shao and Sitter (1996, Section 
5), within each post-stratum of {i;x, =x}. While the sizes 
of the sample post-strata are fixed, both the number of 
nonrespondents and the number of persons from urban or 
rural areas vary from one bootstrap sample to another. We 
calculate the bootstrap estimates in the same way as based 
on the observed data. In particular, the bootstrap data are 
imputed in the same way as the original data if the estimator 
is imputation-based. Finally, the estimated variances and 
standard errors are obtained by the usual Monte Carlo 
approximation based on 500 independent bootstrap samples. 


5. Estimated Number of Households of Different 
Sizes Based on the 1992 Norwegian Consumer 
Expenditure Survey 


In this section we present the estimated number of 
households of sizes one to five and more, and the total 
number of households for the population in Norway aged 
less than eighty years old. The estimation uses the data from 
CES 1992, and is based on the estimators considered in 
Section 4. To compute the estimates we need the number of 
families of different sizes in the population, i.e., M ,, at the 
time of the 1992 survey. The actual number at the time of 
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the survey is not recorded. As an approximation we use the 
numbers at January 1, 1993. These are given in table 4. 
Table 4 


Families and Persons with Age Less than 80 Years 
in Norway at January, 1993 


Number of persons in family Families Persons 

1 person 793,869 793,869 
2 persons 408,440 816,880 
3 persons 261,527 784,581 
4 persons 266,504 1,066,016 
5 or more persons 127653 670,528 
Total 1,857,993 4,131,874 


Note that the average family size for families with 5 or 
more persons is 670,528/127,653 = 5.25. We use 5.25 as an 
estimate of the average household size for households of 
size 5 or more, and divide by 5.25 instead of 5 in all esti- 
mates of H.. 


5.1 Maximum Likelihood Estimation and 
Poststratification 


The estimated household distributions are presented in 
table 5. The estimates are based on the maximum likelihood 
(m.l.) estimator (4.1) using the population model with the 
restricted parametric link function p, , in combination with 
the response models RM1(y,z) and RM2(y,z). To 
illustrate the effect of nonresponse modeling versus post- 
stratification we also present the standard poststratified 
estimator (4.4). We recall that this is the maximum likely- 
hood estimator when ignoring the response mechanism. 
Furthermore, we present the estimated household size 
distribution based on the imputation-based poststratification 
(4.3) with the saturated model (4.9). For assessing the 
sampling variability of the different estimators, the esti- 
mated standard errors are also included. 

The three models that take the response mechanism into 
account give higher total number of households. They also 
give considerable higher numbers of one-person-house- 
holds. This seems sensible since we expect the one-person 
households to have the highest nonresponse rate. And thus, 
these estimates are most influenced by taking the response 
mechanism into account. We note that the restricted para- 
metric link model (3.1) together with the logistic response 
model RM2(y, z) gives practically the same poststratified 
estimates as model (4.9), with also approximately the same 
standard errors. Because of the freedom of model (4.9), with 
perfect fit, it seems that model (3.1) & RM2(y, z) works 
well for estimating the number of households of different 
sizes. Regarding the uncertainty of the estimates, we see as 
one might expect that the standard errors typically seem to 
increase with the number of unknown parameters in the 
underlying model. Also, the total number of households is 
rather accurately estimated, not counting possible bias, 
while it’s clearly most difficult to estimate the number of 
one-person households. 
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In order to evaluate the extent to which the differences 
between the estimates are due to sampling error or non- 
response bias, we consider the estimated standard errors of 
the differences of the point estimates. Some of these are 
given in table 6, using mostly the imputation-based post- 
stratification with the saturated model as a reference. For 
short, we use the terms Est1 — Est4 for the estimates defined 
as they appear in table 5: 


Estl: M.l. estimator based on population model p,, . 
and response model RM1 


Est2: M.l. estimator based on population model p,, . 
and response model RM2 


Est3: Imputation-based poststratification based on the 
saturated model (4.9) 


Est4: Poststratified estimator without imputation. 


Based on tables 5 and 6 we can conclude that Est4 and 
Est3 have different expected values in estimating H,, H,, 
H, and H. Regarding the other comparisons, we see that 
in estimating H, there is a significant difference between 
Estl and Est2/Est3, and note from earlier discussions in 
Section 3.3 that RM2 gives a better fit to the data than RM1. 

The estimates based on the expansion estimator H by? 
given by (4.5), in 100’s, are 390,500, 496,500, 283,900, 
279,900, 148,000 and 1,598,800 with estimated standard 
errors equal to 33,100, 21,700, 14,600, 11,600, 6,100 and 
23,700 for H,,..., H; and H, respectively. The standard 
errors for the differences between these estimates and the 
Est3-estimates are 52,800, 30,900, 19,100, 10,800, 5,400 
and 32,000 for H,,.., H,; and H respectively. These 
expansion estimates indicate serious bias due to non- 
response, especially the estimates for H,, H,; and H, 


with poststratification correcting for some of the bias 
(probably about 50% for the estimates of H, and H). We 
also note that the standard errors for the poststratified 
estimator and this simple expansion estimator are about the 
same. So by reducing the bias with poststratification one 
reduces the total error as well. 

Poststratification corrects for the bias caused by the 
discrepancy between the family size distributions in the 
response sample and the population. From table | and table 
4 we see that these family size distributions are given by (in 
percentages), for x=1, ..., 5: 


Response sample: 14.6 — 20.7 — 19.1 — 27.0 — 18.6 
Population: 19.2 — 19.8 — 19.0 — 25.8 — 16.2. 


Since the number of one-person families is much too low 
in the response sample, so will the expansion estimate of 
H, be. With post strata determined by family size, post- 
stratification corrects for the family size bias in the response 
sample, but does implicitly assume that nonrespondents and 
respondents have the same household size distribution, for a 
fixed family size. Or, in other words, the respondents are 
treated as a random subsample of sampled units with the 
same family size, as mentioned by Little (1993). This is 
most likely not the case. We recall that the family size 
variable was not significant when the household variable 
was included in the response models. Thus it seems 
reasonable to assume, as in our response models, that 
response rates will vary with the actual household sizes 
rather than the registered family sizes. Typically, estimates 
of the number of one-person households will be biased 
when the nonrespondents are ignored. 


Table 5 
Estimated Household Totals for Persons Aged Less than 80 Years in Norway at January 1, 1993, in Units of 100. 
In Parentheses, the Estimated Standard Error of the Estimates 


Maximum likelihood estimator with nonignorable response 


mechanism 
Household Population model % Population model 
size, y Py,x and Py, and 
response model response model 
RMI (y, z) RM2 (y, z) 
558,800 Sy 595,400 
(38,900) (48,000) 
2 520,200 30 525,800 
(20,600) (27,400) 
3 278,900 16 249,100 
(13,800) (20,300) 
4 258,900 Ibs; 269,000 
(9,800) (11,600) 
> 5 125,800 7 126,000 
—_ (4,700) (5,100) 
Total 1,742,600 100 1,765,300 
(25,600) (29,700) 
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Imputation-based Ignoring the response 


poststratification mechanism 
Saturated population % Poststratified % 
and response model estimator 
596,600 34 486,000 29 
(53,500) (35,800) 
523,600 30 507,800 30 
(29,800) (20,000) 
250,000 14 286,200 17 
(19,800) (14,100) 
268,900 ie) 270,600 16 
(11,500) (10,100) 
126,200 Wh 131,300 8 
(5,000) (4,700) 
100 1,765,300 100 1,681,900 100 
(31,900) (23,300) 
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Table 6 
Estimated Standard Errors of the Differences of the Point Estimates in Table 5 


Household size Estl—Est2 Est1—Est3 Est2—Est3 Est4—Est3 
1 29,700 37,000 16,600 42,400 
2 19,300 22,200 8,800 23,100 
3 15,400 15,200 5,300 15,500 
4 6,700 6,500 1,800 6,600 
25 1,700 1,700 500 1,900 
Total 15,300 18,800 8,900 23,300 


After having corrected for nonresponse bias by com- 
pleting the sample with imputed values, the sample itself 
may be skewed compared to the population. To illustrate the 
effect of poststratification to correct for this, we shall 
compare, using the saturated model (4.9), the imputation- 
based poststratified estimates Est3 with the imputation- 
based expansion estimates given by (4.6): 583,900, 567,700, 
244,300, 259,300, 122,400 and 1,777,600 for H,, ..., 
H, and H, respectively. As noted in Section 4.2, see 
(4.14), these estimates are identical to the modified Horvitz- 
Thompson estimates. The standard errors for these estimates 
are practically the same as for Est3. Hence, the alternative 
poststratified estimation methods based on nonignorable 
response models have standard errors at least no worse than 
the modified Horvitz-Thompson estimator. So if one 
reduces the bias with the alternative methods, one reduces 
the total error too. The standard errors of the differences 
between Est3 and this modified Horvitz-Thompson esti- 
mator in the estimates of H,, ...,H; and H are 3,500, 
2,200, 1,100, 600, 200 and 2,100 respectively. Clearly these 
two methods give significantly different estimates for all 
household size totals. In this comparison, one feature stands 
out. The expansion estimate of the number of two-persons 
households, 567,700, is clearly too high, as seen by com- 
paring the family size distributions in the total sample and 
the population (in percentages), for x =1, ..., 5: 


19.2 — 19.8 — 19.0 — 25.8 — 16.2 
18.6 — 23.0 — 17.8 — 24.9 — 15.7. 


Population: 
Sample: 


The sample proportion of persons in two-persons fami- 
lies is much too high, and even though we have corrected 
for nonresponse bias, the expansion estimator, and then also 
the modified Horvitz-Thompson estimator cannot correct 
for a nonrepresentative sample. This will necessarily lead to 
biased estimates of H,. We need poststratification to 
correct for a skewed sample. One can regard the difference 
in expected values for these estimators of H, as being 
close to the bias for the modified Horvitz-Thompson esti- 
mator, and note that an approximate 95% confidence 
interval for this difference is (39,800, 48,400). 

For robustness considerations we also present the esti- 
mates from the cumulative logit model mentioned in Sec- 
tion 3.1 together with RM1 (yy, z), which we know fits the 


data poorly. They are, in 100’s: 591,800, 501,000, 265,200, 
267,300, 128:200%and. 1,753,500 for HG,!.... H. and H, 
respectively. Compared to table 5, this seems to indicate that 
a reasonable model for response plays a more important role 
than a good population model. It is also evident that 
nonresponse modeling makes a difference, as seen when 
compared to poststratification and simple expansion. 


5.2 Comparison with the Currently Used Estimates 
in CES, the Quality Survey for the 1990 Census 
and a Projection Study 


Since 1993, an alternative, computationally simpler, 
modified Horvitz-Thompson estimator of type (4.14) has 
been in use in the production of official statistics from CES, 
see (Belsby 1995). We recall from Section 2 that the 
weights are the inverse sampling probabilities of the 
households, multiplied with the estimated probability of 
response. The response probabilities are estimated using a 
logistic model similar to RM2 (y, z) with place of residence 
and household size as explanatory variables. For the 
nonrespondents with unknown household size the registered 
family size is used instead, replacing (3.5). Thus, the 
weights may be regarded as an approximation to using (3.5). 
Of course, (3.5) is possible only when a population model is 
considered, which CES has not done. Table 7 presents 
estimated household distribution based on this CES- 
modified Horvitz-Thompson estimator. 

The quality survey for the Census 1990, PES 1/990, 
contains 8,280 respondents and uses practically the same 
household definition as CES. The response rate was 95%. 
The H , —estimates uses poststratification with respect to 
household size in the Census. However, no attempts were 
made to correct for possible nonresponse bias with respect 
to actual household size. PES deals with the whole 
population. Table 7 has the estimates for the 0-79 age 
group with the same poststratification method as in PES. 

Table 7 also presents estimates based on the Household 
Projections study by Keilman and Brunborg (1995). This 
study simulates household structure for the period 1990 to 
2020. The data sources are 28,384 individuals from the 
1990 Population and Housing Census and 1988 Family and 
Occupation Survey. Keilman and Brunborg project for the 
whole population in 1992. We adjust their estimates to the 
0 —79 age group. 
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Table 7 
Estimated Household Size Totals for Persons Less than 80 Years in Norway at January 1, 1993 
with CES-modified Horvitz-Thompson, PES 1990 and Projections, in Units of 100 


Household size CES-Modified %o PES 1990 % Projections % 
Horvitz-Thompson 
1 622,900 35 626,000 35 668,300 oF 
pi 518,500 29 494,200 28 549,000 30 
3 259,900 15 291,500 16 211,900 12 
| 258,500 |) 250,000 14 221,500 12 
>5 124,600 7 115,300 6 97,500 5 
Unknown 78,500 4 
Total 1,784,400 1 1,777,000 99 1,826,700 100 
Table 8 


Estimated Probability of Response Based on the Method Used 
in CES Since 1993, in Percentages 


Household size 


Place of residence 1 2 é + 5 or more 
CES-method 
Rural 44.53 66.24 74.55 73.54 80.07 
Urban 36.01 57.90 67125 66.09 73.80 
Model p,,, in (3.1) combined with RM2( y, z) 
Rural 47.77 60.90 79.05 73.26 81.52 
Urban 38.92 52.04 72.44 65.62 75.46 


The estimates in table 7 support our impression that the 
estimates based on modeling the response mechanism leads 
to less biased estimates compared with ignoring the response 
mechanism as in mere poststratification or simple expansion. 
This is especially true for the one-person households and the 
total. The current “official estimator’, the modified Horvitz- 
Thompson seems to give estimates of the night magnitude 
and in fact is closer to the results of PES 1990 than the 
modelbased estimates. However, this is more by accident. As 
a method it has some problems even in a representative 
sample. We can study this by estimating the response prob- 
abilities. Table 8 presents the results together with the esti- 
mates based on RM2(y, z) & (3.1) from table 3. 

Compared to the estimated response probabilities based 
on model RM2(y, z) with (3.1), we see that replacing 
household size with family size in the nonresponse group is 
not a satisfactory approximation. Hence, if compared with 
the modified Horvitz-Thompson estimator in Section 5.1 
based on the saturated model (4.9), the latter one would be 
preferred. For this particular survey, the CES approach 
overestimates the probability of response for household of 
size 2, which in a representative sample would lead to 
underestimating of H,. The estimated response prob- 
abilities will most likely be biased when we are using family 
size in place of household size in the nonresponse group 
when estimating the parameters in the response model. This 
bias is an additional problem to the previously mentioned 
one, that the modified Horvitz-Thompson estimates will be 
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similar to the imputation-based expansion estimates and 
cannot correct for nonrepresentative samples (as has been a 
problem in CES since 1993). In the 1992 CES, however, the 
sample is skewed with a too high proportion of families of 
size 2, and the H, —estimate will be of the might 
magnitude, by accident. 


6. Conclusions 


We have investigated modeling and methodological 
issues for estimating the total number of households of 
different sizes in Norway, based on the Norwegian 
Consumer Expenditure Survey (CES ). The main issue is 
how to correct for bias due to nonignorable nonresponse. 
The existing estimation method in CES is a modified 
Horvitz-Thompson estimator that includes a correction for 
nonresponse by estimating response probabilities. We have 
considered basically two modelbased approaches, a 
maximum-likelihood estimator and imputation-based post- 
stratification after registered family size. With a population 
model that corresponds to a group model after family size 
only, these two estimators are identical. This family group 
model for household size and a logistic link for the response 
probability using household size as a categorical variable 
seem to work well for our estimation problem. 

In analyzing the 1992 CES, we find serious bias due to 
nonresponse, especially the estimates for H, and H, with 
pure poststratification (without imputation) correcting for 
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some of the bias (probably about 50% for the estimates of 
H, and #7). Poststratification does not, however, take 
into account possible nonresponse bias dependent on 
household size. Our response models assume that the 
response rates will vary with the actual household sizes 
rather than the registered the family sizes, and it is quite 
evident that such nonresponse modeling makes a difference, 
leading to less biased estimates than mere poststratification 
or simple expansion, especially of H, and H. 

The modified Horvitz-Thompson estimates used in the 
official statistics from CES correspond to imputation-based 
expansion estimates. Hence, they cannot correct for nonre- 
presentative samples. The study in this paper shows that, in 
addition to a nonignorable response model it is also 
necessary to poststratify according to family size, i.e., using 
a population model given family size. Hence poststrat- 
ification, response modeling and imputation are key ingre- 
dients for a satisfactory approach. 

In any estimation problem of totals in survey sampling, 
one must be aware of the fact that a Horvitz-Thompson 
estimator cannot correct for skewed samples, even when 
modified with good response estimates. Poststratification 
should always be considered as well as imputation based on 
a response model, nonignorable when needed. 


Appendix Al 


The data for rural and urban areas separately are given in 
table A1. 


Appendix A2 
Theorem. Assume model (3.1) for Y. ie, 
P(Y=ylx, z)=p,., is independent of z, but otherwise 
the p,,,’s are completely unknown with the only restriction 


being that Ly Py Wel for all values of x, for all k. The 
response mechanism is arbitrarily parametrized, i.e., no 
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assumption is made about P(R=1|1Y =y, x, z). Then the 

maximum likelihood estimates for p, , are given by 

: n,, +n., 

Dy. chee a Tae 
ee Ne vit, 


Proof. Let q,,,=P(R=1IY=y, x, z). The log 


likelihood is given by 
am Da Pans De Ga. 

+O Ym, (z)logP(R=01x, z) 
= Lidl Ps ae DEEN ‘Car oe 
+ LE ma (DORA DP 5 dres) | 


We use the Lagrange method and maximize G=/+ 

ey oes eee 

Let the solutions be p, , (A, ), and determine the ),’s 
such that ©, p, ,(A,)=1, for all x. No matter how the 
dy, 8 are parametrized, the mle p,, must satisfy, by 
solving the equations dG/dp,, , = 0, 


" 
i lt sana +2%.=0 (Al) 


P(R=01x, z) 


y, X 
which is equivalent to: 
> M,, (Z) 
zr (&=01 xz) 


PAK = Ur yee ae 
P(R=OIx, z) 


1 
ay) nee) 
z=0 


Table Al 
Family and Household Sizes for the 1992 Norwegian Consumer Expenditure Survey, Split into Rural 
and Urban Areas. The Upper Entry is for the Urban Group 


Household size 


Family size 1 2 3 4 15 
1 urban 28 24 7! 2 0 
rural DS) 24 13 7 2 
2 urban 6 70 12 3 0 
rural 3 107 25 1 3 
3 urban 4 8 Si) 11 3 
rural 6 i 74 29 3 
4 urban 0 3 15 80 5 
rural 2 10 22 sy 12 
= 5 urban 0 | 0 6 66 
Al 1 3 4 11 115 
Total urban 38 106 91 102 74 
Total rural 67 161 138 199 135 


Total response Non-response Total Response rate 
61 78 139 0.439 
101 His) 176 0.574 
| 84 175 0.520 
139 76 215 0.647 
83 40 123 0.675 
129 ad 180 0.717 
103 43 146 0.705 
197 80 2Tt OTE 
73 28 101 O13 
134 32 166 0.807 
411 273 684 0.601 
700 314 1014 0.690 
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We determine 1, by summing over y: i . P(R=01Y=y,%2z) 
Ny (Z) =, CL) a rn oe 
PUR ULG Z) 
os ee 
* 25 P(R=0!Ix,z) and, from (A2), 
P(R=OlIx, z) bone (.Z) 
-ym i Py,=n [m+ = 
P(R=0l ada) dene” x Py 
hence /| n’, | 
al LO WORE 
=> —— mM, (z) —~(m,+m,, ). Py x 
 P(R=01x, z) 


or equivalently, 


It follows from (A1) that p, . satisfies the following 


y, X aw * = 
relation: Py, x (m, a 1 ae Pe pl ee) 
n * 
Toe = ________ (A?) ve ee, TOS bins 
yy. L PUR= OLY %,2) 1 Coots Disnece™ Fare Gee Ee 
m,+mM,, -)'m,, (Z) A ie sta, 
F PUR=0l xz) 


The imputed values are given by , from (4.2), 
Appendix A3 


Table A2 
The Completed Sample Including the Imputed Values, Split Into Two Groups, Rural and Urban. The Upper Entry 
is for the Urban Group and the Lower Entry is for the Rural Group. Based on Model (3.1) and RM1(y, z) 


Household size 


Family size 1 2 a 4 S 5 Total 
1 urban 77.8 44.1 129 39 0.3 139 
rural 103.6 43.1 18.4 8.7 2 176 
2 urban 10.8 SH ie) 22.1 3.8 0.4 es 
rural eS) 168.6 339 Ve 3.3 215 
3 urban IES) 14.3 81.3 16.4 3.6 123 
rural LOsT 255 104.8 35.6 am 180 
4 urban 0.8 6.4 219 110.3 6.6 146 
rural Bis) 16.7 joa 206.9 14.8 21g 
> 5S urban 0.5 2.4 1.0 9.0 88.2 101 
rural 1.6 4.7 D2 14.4 140.1 166 
Total /urban 97.4 20501 139.2 143.4 99.1 684 
rural 126.9 258.4 197.4 267.3 164.2 1,014 
Table A3 


The Completed Sample Including the Imputed Values, Split Into Two Groups, Rural and Urban. The Upper Entry 
is for the Urban Group and the Lower Entry is for the Rural Group. Based on Model (3.1) and RM2 ( y, z) 


Household size 


Family size 1 2. 5 4 > 5 Total 
1 urban 81.6 Ay) 10.4 4.0 0.3 139 
rural 107.5 41.5 15.9 8.8 We 176 
2 urban 11.9 140.4 18.3 3.9 0.5 175 
rural 8.6 170.9 30.3 1.8 3.4 215 
3. urban 9.4 16.1 WS 18.6 Bei) 123 
rural 13.4 2h 96.5 38.5 3.9 180 
4 urban 0.8 G2 18.9 113.5 6.6 146 
rural 3.7 16.2 29.2 PASal 14.8 Qi 
> 5 urban 0.5 23 0.6 9.3 88.3 101 
ea hei, 4.6 4.6 14.9 140.2 166 
Total /urban 104.2 207.7 12334 149.3 99.4 684 
rural 134.9 260.9 io QIN 164.6 1,014 
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Appendix A4 


Table A4 
The Completed Sample Including the Imputed Values, Split Into Two Groups, Rural and Urban. 
The Upper Entry is for the Urban Group and the Lower Entry is for the Rural Group. 
Based on Model (4.9), i.e, Imputations Determined by (4.7) and (4.8) 


Household size 
Family size 1 Z 4 > 5 Total 
1 urban 79.6 47.2 9.4 2.8 0.0 139 
rural 108.3 38.5 16.9 a9 2.4 176 
2 urban ea 137.7 16.0 4.2 0.0 175 
rural 59 L716 32.5 1.4 3.6 215 
3 urban 11.4 (eee 76.2 15.6 4.1 123 
rural 11.8 Digpe) 96.2 41.1 a6 180 
4 urban 0.0 5.9 20.0 bi 3.2 6.9 146 
rural 3.9 16.0 28.6 214.0 14.5 244, 
Ses 0.0 2.0 0.0 8.5 90.5 101 
ronal 2.0 4.8 5.2 15.6 138.4 166 
Total /urban 108.1 208.5 121.6 144.3 101.5 684 
rural 131.9 258.2 179.4 282.0 162.5 1,014 
Table A5 
The Total Numbers of Family and Household Sizes for Imputed Complete Sample. Based on Model (4.9) 
Household size 
Family size 1 Z 3 4 >5 Total 
i 187.9 85.7 20.5 Ly 2.4 Biles, 
y 20) 309.2 48.6 DT 3.6 390 
3 PIG 43.0 172.4 56.7 Pe 303 
4 3:9 21.9 48.7 SOE ORIES 423 
>5 2.0 6.8 De 24.1 229.0 267 
Total 240.0 466.6 301.1 426.3 264.0 1,698 
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Bayesian Analysis of Nonignorable Missing Categorical Data: 
An Application to Bone Mineral Density and Family Income 


Balgobin Nandram, Lawrence H. Cox and Jai Won Choi ' 


Abstract 


We consider a problem in which an analysis is needed for categorical data from a single two-way table with partial 
classification (i.e., both item and unit nonresponses). We assume that this is the only information available. A Bayesian 
methodology permits modeling different patterns of missingness under ignorability and nonignorability assumptions. We 
construct a nonignorable nonresponse model which is obtained from the ignorable nonresponse model via a model 
expansion using a data-dependent prior; the nonignorable nonresponse model robustifies the ignorable nonresponse model. 
A multinomial-Dirichlet model, adjusted for the nonresponse, is used to estimate the cell probabilities, and a Bayes factor is 
used to test for association. We illustrate our methodology using data on bone mineral density and family income. A 
sensitivity analysis is used to assess the effects of the data-dependent prior. The ignorable and nonignorable nonresponse 
models are compared using a simulation study, and there are subtle differences between these models. 


Key Words: Bayes factor; Chi-squared statistic; Importance function; Markov chain Monte Carlo; Multinomial- 
Dirichlet model; Robust; Two-way categorical table. 


1. Introduction 


It is a common practice to use two-way categorical tables 
to present survey data. For many surveys there are missing 
data, and this gives rise to partial classification of the 
sampled individuals. Thus, for the two-way table there are 
both item nonresponse (one of the two categories is missing) 
and unit nonresponse (both categories are missing); see 
Little and Rubin (2002, section 1.3) for definitions of the 
three missing data mechanisms (MCAR, MAR, MNAR). 
Thus, there are four tables (one table with the complete 
cases, and three possible supplemental tables: one table with 
row classification only, one table with column classification 
only, and one table with neither row nor column classifi- 
cation). One may not know how the data are missing. Thus, 
we use a model in which the likelihood function accounts 
for differences between the observed data and missing data 
(i.e., nonignorable missing data); see Rubin (1976) and 
Little and Rubin (2002) for the relation between igno- 
rability/nonignorability and these three missing data 
mechanisms. Because there are well-known advantages of 
the Bayesian method over the non-Bayesian method for 
these problems, we propose a Bayesian analysis of a general 
rXc categorical table, consisting of a table with complete 
cases and three supplemental tables. Specifically, we 
develop a Bayesian method to estimate the cell probabilities 
and test for association between the two categorical 
variables. 


We assume that the only information available to the data 
analysts is the complete cases and the three supplemental 
tables. Specifically, we assume that there is no information 
(either from covariates or prior information) about non- 
ignorability. In our Bayesian approach, the survey design 
features have been suppressed (i.e., there are no survey 
weights and there are no clustering or stratification). 
Sometimes survey data are presented to the public with 
certain features of the data suppressed for reasons of 
convenience and confidentiality. We recognize that both the 
ignorable and the nonignorable nonresponse models may be 
incorrect when they do not take account of these features. 
However, the parameters in the ignorable nonresponse 
model are identifiable and estimable, and one can take 
advantage of this fact to construct a nonignorable non- 
response model which is related to the ignorable non- 
response model. Also, in the ignorable nonresponse model 
we assume that there is a MAR mechanism that drives the 
nonresponse, and there may be information in the in- 
complete cases (i.e., the two tables with observed row and 
column margins). Without any information about the degree 
of nonignorability, it is sensible to generalize the ignorable 
nonresponse model. This is how we attempt to accomplish 
our objectives in this paper. 

This paper has five sections. In section 1 we have further 
discussion of the problem, and we review related meth- 
odology. In section 2, we describe a 3x3 table of bone 
mineral density (BMD) and family income (FI) from the 
third National Health and Nutrition Examination Survey 
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(NHANES II). This is used mainly for illustration. In 
section 3, we describe the methodology to obtain estimates 
of the cell probabilities, and we use the Bayes factor to test 
for association of the two attributes. We accomplish these 
objectives by first constructing an ignorable nonresponse 
model, and we show how to expand an ignorable non- 
response model into a nonignorable nonresponse model. In 
section 4, we analyze the NHANES III data to demonstrate 
our methods. Also, a simulation study gives further com- 
parison of the ignorable and the nonignorable nonresponse 
models, and a sensitivity analysis shows that inference is not 
too sensitive to the choice of an important prior distribution. 
Finally, section 5 has concluding remarks. 


1.1 Discussion of the Problem 


We do not know whether an ignorable nonresponse 
model or a nonignorable nonresponse model is appropriate, 
but it is worthwhile noting that Cohen and Duffy (2002) 
point out that “Health surveys are a good example, where it 
seems plausible that propensity to respond may be related to 
health.” Thus, nonignorable nonresponse models are im- 
portant candidates for the analysis of data from health 
surveys. For a general r xc categorical table (two categor- 
ical variables, one with r categories and the other with c 
categories) with nonresponse, our objectives are to show 
how to (a) make inference about the cell probabilities, and 
(b) test for no association between the two categories using 
the Bayes factor. While (a) comes directly from the 
modeling, (b) needs one extra step. 

Let I, be the cell indicator for the i individual in a 
rxc table for i=1,...,n idividuals. Then, it is well 
known that if the J, are independent and identically 
distributed, the Pearson’s chi-squared statistic has 
Niet vagy Otherwise the Pearson’s chi-squared statistic 
does not have a 2 a en and this is true when there are 
missing data and the respondents and nonrespondents differ. 
When this is the case, adjustments must be made to the 
Pearson’s chi-squared statistic. Within the non-Bayesian 
framework Chen and Fienberg (1974) and Wang (2001) 
have corrections for incomplete two-way tables. Although 
not directly relevant here, it is pertinent to mention that 
similar adjustments have been made for cluster sampling 
and stratified random sampling (Rao and Scott 1981, 1984). 
The works of Chen and Fienberg (1974) and Wang (2001) 
can essentially handle item nonresponse only; unit non- 
response is excluded because the modeling is motivated by 
the ignorable nonresponse models (e.g., see discussion in 
Kalton and Kasprzyk 1986). 

The Bayesian method permits us to use a procedure that 
does not rely on asymptotic theory, incorporate non- 
ignorable missingness into the modeling and obtain an 
alternative to Pearson’s chi-squared statistic for testing for 
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no association; see Little (2003) for a discussion of the well- 
known advantages of the Bayesian approach in survey 
sampling. Our alternative to the Pearson chi-squared statistic 
is based on the Bayes factor (Kass and Raftery 1995). This 
is a Statistic that compares a model with association and one 
with no association via the ratio of their marginal like- 
lihoods under the ignorable and the nonignorable non- 
response models separately. 

Little and Rubin (2002, chapter 15) discuss the non- 
ignorable nonresponse problem. For example, Rubin, Stern 
and Vehovar (1995) (also discussed in Little and Rubin 
2002, page 345) provide an interesting analysis of the 
November/December 1990 Slovenian Public Opinion 
survey in which there were data on 2,074 prospective voters 
in their plebiscite with three dichotomous variables; there is 
12% nonresponse. They fit both ignorable and nonignorable 
nonresponse models (loglinear with all interactions) to the 
data, and they were satisfied with the ignorable nonresponse 
model. However, they stated “Of course, this does not mean 
that MAR should be automatically applied in all cases. 
Analyses assuming MAR are not likely to be adequate if a 
survey has large amounts of nonresponse, if covariate 
information is limited, or for cases where the missing-data 
mechanism is clearly nonignorable (e.g., censored data).” 


1.2 Related Methodology 


Our methodology is different from Rubin, Stern and 
Vehovar (1995). We start with Nandram and Choi (2002 a, 
b) in which a parameter y centers (can be viewed as an 
index) the nonignorable nonresponse model on_ the 
ignorable nonresponse model. When y=1, the non- 
ignorable nonresponse model is the ignorable nonresponse 
model, and thus, the nonignorable nonresponse model 
“degenerates” into the ignorable nonresponse model when 
y=1; see also Forster and Smith (1998). This is useful 
because the nonignorable nonresponse model contains the 
ignorable nonresponse model as a special case; thereby 
expressing uncertainty about ignorability. Draper (1995) 
called this a continuous model expansion, and he has 
recommended the use of a continuous model expansion over 
a discrete model expansion (i.e., finite mixtures) whenever it 
is possible. We simply call the continuous model expansion 
an expansion model. Nandram and Choi (2002 a, b) obtain 
the centering by taking ylv~Gamma(v,v) in which 
E(ylv)=1, var (ylv)=1/v. 

Nandram and Choi (2002 a) analyze binary data on 
household crimes in the National Crime Survey, and 
Nandram and Choi (2002 b) analyze binary data on doctor 
visits in the National Health Interview Survey. While 
Nandram and Choi (2002 a) has more comparisons, 
Nandram and Choi (2002 b) has more sensitivity analyses. 
Nandram, Han and Choi (2002) describe two hierarchical 
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Bayesian models, an ignorable and a nonignorable non- 
response model, for the analysis of count data from several 
areas, the counts in each area being described by a multi- 
nomial distribution. In all these works the issue of associa- 
tion is not relevant because there is a single categorical 
variable. 

The approach in Nandram and Choi (2002 a, b) is 
attractive, but it does not apply immediately to the current 
application on r xc categorical table. Specifically, only one 
centering parameter was needed in Nandram and Choi 
(2002 a, b). To extend the method of Nandram and Choi 
(2002 a, b), one needs rc centering parameters. Each of 
these parameters has to have a distribution centered at one to 
allow degeneration to the ignorable nonresponse model. 
There are also inequality constraints that must be included 
in the nonignorable nonresponse model. Thus, while this 
idea is attractive, the methodology needed to apply the work 
of Nandram and Choi (2002 a, b) is much beyond the scope 
of our current paper. 

Nandram, Liu, Choi and Cox (2005) extend the work of 
Nandram, Han and Choi (2002) in two important directions 
to (a) consider several two-way categorical tables instead of 
one-way tables and (b) develop a method to study the 
association between the two categorical variables. Nandram, 
Liu, Choi and Cox (2005) analyze data on the relation 
between bone mineral density (BMD) and age from thirty- 
five counties in the third National Health and Nutrition 
Examination Survey. In each county the data are cate- 
gorized into two levels of age and three levels of BMD (ie., 
there are thirty-five 2x3 categorical tables). Note that the 
age of everyone is observed, but the BMD values for a large 
number of individuals are not observed. Thus, for each 
county there is a single table with complete cases, and one 
table with row totals (i.e., the ages of these individuals are 
known, but their BMD values are missing). Here, our 
objective is to extend the work of Nandram, Liu, Choi and 
Cox (2005) to a general r xc categorical table. This is an 
important advance because now there are three supple- 
mental tables (one table with row classification only, one 
table with column classification only, and one table with 
neither row nor column classification) instead of just one 
with row totals as in Nandram, Liu, Choi and Cox (2005). 


2. Data on Bone Mineral Density 
and Family Income 


We briefly describe the 3x3 categorical table of bone 
mineral density (BMD) and family income (FI). FI is a 
discrete variable, and there are three levels: low, medium 
and high. While BMD is a continuous variable, the World 
Health Organization has classified BMD into three levels: 
normal, osteopenia and osteoporosis; see Looker, Orwoll, 
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Johnston, Lindsay, Wahner, Dunn, Calvo and Harris (1997, 
1998). BMD is used to diagnose osteoporosis, a disease of 
elderly females, and in NHANES III it is measured for 
individuals at least twenty years old (i.e., we use the data on 
white females only with chronic conditions older than 
twenty years). 

Among those participated in the examination stage, about 
62% of the individuals have both FI and BMD observed, 
8% with only BMD observed, 29% with only income 
observed, 1% with neither income nor BMD. The dataset, 
used in our study, is presented in Table 1 as a 3X3 ca- 
tegorical table of BMD and FI. Our problem is to estimate 
the proportion of individuals at each BMD-FI level and to 
test for association between BMD and FI. In NHANES III 
the response rate increases up to age twenty years, and 
stabilizes after that age; race, sex and the sampling weights 
play a minor role (see Nandram and Choi 2005). Thus, for 
this application we assume that the only data available are 
the four tables of BMD and FI, and we develop a 
methodology for this situation. 


Table 1 
Classification of Bone Mineral Density (BMD) and Family Income 
(FI) for 2,998 White Females, at Least 20 years Old (20+) 


FI 
BMD 0 1 2, Missing Sum 
0 621 290 284 135 1,330 
1 260 131 117 69 577 
2 93 30 18 Diy, 168 
Missing 456 156 266 45 923 
Sum 1,430 607 685 276 2,998 


BMD: 0(> 0.82g/em’; normal), 1(> 0.64, < 0.82g/cm’; 
osteopenia), 2(< 0.64g/cm”; osteoporosis); FI: O(< 
$20,000), 17 $20,000, < $45,000), 2 $45,000); BMD 
is only measured for age 20+. 


Note: 


It is difficult to assess an association between BMD and 
FI when there are many individuals not completely 
classified (i.e., missing data). As discussed in the literature, 
not necessarily on NHANES III, there are several poten- 
tially important confounding variables such as age, 
smoking, dietary calcium intake, estrogen replacement 
therapy, physical activity, educational attainment, health 
status and alcohol consumption (see Ganry, Baudoin and 
Fardellone 2000). Farahmand, Persson, Michaelsson, Baron, 
Parker and Ljunghall (2000) stated that for postmenopausal 
women, aged 50-81 years, from six counties in Sweden, 
higher household income is associated with decreased hip 
fracture risk. Using complete data from NHANES IL, 
Lauderdale and Rathouz (2003) studied the regression of 
bone mineral content on economic indicators (e.g., educa- 
tion and poverty income ratio). An adjustment was made for 
other factors such as age, height and weight. They conclude 
that “Bone density does not reflect economic conditions as 
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strongly or consistently as physical stature.” Unfortunately, 
these works do not address the nonignorability of the 
missing data; missing data are not discussed. Also, the 
response rate to income items is usually low. 

We have looked at the data for the complete cases more 
closely. We fit a multinomial-Dirichlet model with associa- 
tion and one with no association . The model with asso- 
ciation is nl p~ Multinomial(n, p) and _ p ~ Dirichlet 
(1,...,1). Note that by no association we mean that 
py pO walen, tile a1, nee where yap ya 
and }y_) po =]. Thus, for the model with no association, 
n| p ~ Multinomial (n, p), p“ ~ Dirichlet(1, ..., 1), and 
independently pe ~ Dirichlet(1, ..., 1), where p” and 
p’” have r and c components respectively. It is easy to 
show that the marginal likelihood with association (as) is 
Pas(h) =(re—V)!n!/(n+re—-1)! and with no association 
(nas) is 


(r—1)'(c-1)! 
(rc—-1)! 


(n+rc-—1)! 
(n+r—-—l)!(n+c-l)! 


fly rg! Pla, n.,! 
fi Bee | be Nx! 


Consider our data in Table 1 again. Under independence 
(i.e., no association) the observed chi-squared statistic is 
12.7 on 4 degrees of freedom with a p-value of 0.013 and 
the hypothesis of no association is rejected. On the 
logarithmic scale, the marginal likelihoods are _p,,,.(m) = 
—46.2 and p,,(n) =—49.6 resulting in a log Bayes factor 
of 3.40 for evidence of no association relative to association. 
Therefore, while the chi-squared test provides strong 
evidence against no association, the log Bayes factor 
provides strong evidence for no association. Thus, there is a 
contradictory evidence for no association. See Mirkin 
(2001) for a review of interpretations of the chi-squared 
Statistic as a measure of association or independence. 

How sensitive is the Bayes factor to the choice of the 
prior distributions? First, note that the prior density that any 
reasonable person might use in this problem is the Dirichlet 
distribution. For the model with association we have 
selected the prior distributions to be p ~ Dirichlet (y), and 


Pras) = Pas (1) 


for the model with no association p“? ~ Dirichlet (a) and 
, (2) am (Eine 

independently Pp ne Dirichlet (B). Let nj" =dia Ni» 
J-e., f and My — Dang, hi, oe, Gr Lule, Its 


easy to show that the Bayes factor for a test of association 
versus no association is 


aeie Danae (n. +a)D.(n. +B) 


D,.(y)/D,(@)D.(B) 


where D_(.) refers to the Dirichlet function with r com- 
ponents, efc.; see section 3.1 for notations. Then, we choose 
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each of the components of a,B and y to be Kk (e.g., in 
P.,(n) and p,,,(m),k=1). Sensitivity to the choice of 
prior distributions can be studied in terms of «. Here k=1 

corresponds to the prior distributions that are usually used in 
the multinomial-Dirichlet model, and k= 0.50, Jeffreys’ 
prior. Thus, we have chosen «=0.25,0.5, 1.0, 1.5, 2, 3, 

and the corresponding Bayes factors (log scale) are 4.7, 3.6, 
3.4, 3.9, 4.7, 6.6. Thus, while the Bayes factor is sensitive to 
the choice of the prior distributions, it is not too sensitive. Of 
course, if there is informative prior information, in which « 

is substantially large, it is a different issue. 

The Pearson chi-squared statistic is dominated by cells 
(3, 1) and (3, 3) with squares of the Pearson residuals being 
4.61 and 6.15 respectively (the observed chi-squared 
Statistic is 12.7 ). It is interesting that the Bayes factor tends 
to smooth this effect out. We have collapsed the two 
categories, osteopenia and osteoporosis, into a single 
category. For this 2x3 categorical table, the chi-squared 
test statistic is 1.7 on 2 degrees of freedom with a p-value 
of 0.42. The marginal likelihoods are p,,.(m) =—28.2 and 
P2,;(N) =—32.0 resulting in a log Bayes factor of -3.81. 
Therefore, both tests suggest no association for this 2x3 
table. Thus, based on these data it is hard to believe that 
there is an association between BMD and FI. The question 
that now arises is “Can this conclusion change if we take 
into account the incomplete data?” 


3. Methodology and Nonresponse Models 


First, we describe the notation. Second, we describe the 
ignorable nonresponse model. Third, we construct a non- 
ignorable nonresponse model by expanding the ignorable 
nonresponse model. Fourth, we discuss the Bayes factor. 
Finally, we describe how to specify an important prior 
distribution. 


3.1 Notation 


For a rxc categorical table, let J,,=1 if £° indi- 
vidual falls in the j" row and k™ column and 0 otherwise. 
Also, let J,, =1 if the ¢" individual falls in table s (s =1: 
complete cases; s=2: table with row totals; s=3: table 
with column totals; s=4: table with individuals un- 
classified), and J,,=0 otherwise, s=1,2,3,4 with 
Sy PRs wa 8 foe oni (ead PRs W tee Pome eo pel EES 
components corresponding to the four tables. 

Let p,, be the probability that an individual belongs to 
cell (j, k) of the rxc table, and let 1,, be the probability 
that an individual belongs to the s” table, given that cell 
status (j, k). For the ignorable nonresponse model 1, = 
m,, but for a nonignorable nonresponse model 1, 
depends on at least one of j and k as well. We will also let 


Ss? 
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p be the vector paw Dee reki= iepe;and TC iy 
be a vector with components {Tin SalgeneA) pislyusr; 
k Shove, 


Then, we take 
PA D “ Multinomial{1, P}, (1) 
where een P p= pe OFS lees, oc 
For the parameters p we take 
p ~ Dirichlet(1, ..., ), py, 20, 03) 1. (2) 
j=l ke 


Henceforth, we will use the notation that a k-dimensional 
vector, x ~ Dirichlet (ct) to mean f(x) ={ITi x; aes 
D, (ct), x; 20, Xj x;=1, where D, (ct) = {IIa r(e,t))/ 
I(t) is the Dirichlet function with c, >0, Yin jul. 

Assumptions (1) and (2) are the same for both the 
ignorable and nonignorable nonresponse models, and they 
are standard when there are no missing data. 

Let the cell counts be Voie = Det L ing J 8 = 1, 203, Aulor 
the four cases. Here y, je are observed and Vou $= 2.3.4 
are missing (i.e., latent variables). For y,, we know that 
Diet Lk= Yije =Mo» the number of individuals with com- 
plete data; for y,, we know that >). y., =u;, where the 
row margins u,,j= 1....,7, are observed;, for V3 jn We 
know that >; ¥3,=V,, Where the column margins 
v,,k=1,...,c are observed; and for Yajn We know that 

-1 Lk=1 Ya jx = Ww. Throughout we assume that all infer- 
ence is conditional on ny,u,v and w, and we will 
suppress this notation whenever it is understood. Whenever 
it is convenient, we will use notations such as >, jj. Yo. = 


a1 Lika Yoeo Us, je Max = Meee ja Wa My and 
Yay =VaV3 Va)» Vox, =i Vag) etc. :~—S-s Where 
Y= Vo J=h. Hk =L...,0), $=1,2,3,4. Also, Doi 


Yj =n. We will also use y,. 
Y =(V1, V2 V3> Ya): 


=. Lik Y sik, Y-jik = es Y sik and 


3.2 Ignorable Nonresponse Model 
For the ignorable nonresponse model we take 
J, 2 = Multinomial{ 1, x}. (3) 


That is, there is no dependence on the cell status of an indi- 
vidual. 

Then, the augmented likelihood function for p, 7, y,.) | 
Wis; BV wis 


4 
8(P, T, Yu) 1Y1, M9, U,V, w) TT mi 


i ; a (4) 


subject to D1 Dia Vijg =Mo. Léa Yo, =4j. J=L---7, 
Dia Y3je =VeK=1,...,€, amd Yi Dia Yay, =W- There 


21% 


are three interesting features in (4). First, under ignorability 
the likelihood function separates into two pieces, one that 
contains the 7, only and the other the p,,, and inference 
about these two parameters are unrelated. Second, inference 
about 7, is based only on the observed y, (i.e., the suf- 
ficient statistics for 7,,7™,,7™, and 1, are essentially the 
proportions of cases in the first, second, third and fourth 
tables respectively). Third, under the ignorable nonresponse 
model, the uw; and the v, contain information about the 
Px3W does not contain any information about the p ,. 
This is easy to show; ates i, denote theseti{i(y5; y5, 
¥4)* Lk=t Voje =U js T= Ty Liat V3 jz =Veok =1,...5€, 
ee a wh, by o 


> TW 4-1] 


(¥2,¥3,Y4eT s=l j=l k=l Vit! j=l 
P ix 


£; 
V1 jk 
c Pix 
AE ars E } j=! I ely 
P ix 


Finally, for the parameters a we take 


4 
see Sots 22,0,» att tel (5) 


Ss 


m ~ Dirichlet(1 


Note that this is a uniform probability density in four- 
dimensional space, and there are no hyperparameters in this 
model. Thus, for the ignorable nonresponse model, com- 
bining (2) and (5), the joint prior density is 


g,(p,m) <1, p, 20, Yr ys pie= lye: om, = (6) 
j=l k=l 
which is proper. 
Finally, combining the likelihood function in (4) with the 
joint prior density in (6) via Bayes’ theorem, the joint 
posterior density of the parameters 2, p and y,, 1s 


4 Y sik 
rtm to =| ET | TTL List a (7) 


s=l =1 j=l k=l Voix’ 


A posteriori p and 7 are independent. Inference about 
m™ is easy because 1 Y1, Yq) ~ Dirichlet(y,. +1,..., 
yy. +1), which is independent of y,,,. Inference about p 
can be obtained using a simple pues sampler because, 
letting gi = Pp/ Xia Pye and qi =Py/ La Pj,» the 
conditional probabilities are 


Bie Dirichlet Vag + he oc. Vettel) 
ind 3 , 
Yo, !D.Uj, V2) ~ Multinomial(u ;,q‘”), j=i,. 


ind 4 . 
¥3_ Ps Vj» Yo) ~ Multinomial(y,,9;”), k =1,...,¢, 
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y,! DP, W,Y (4, ~ Multinomial(y, p). (8) 


Clearly, the parameters p and qm are identifiable and 
estimable. Also, note that y, in (8) is a latent variable and 
that it does not contribute to inference about p. Rather it 
assists in the computation by providing a simple Gibbs 
sampler. However, we note that information in y,, via w, 1s 
important under a nonignorable nonresponse model. 


3.3 Nonignorable Nonresponse Model 


For nonignorable missing data we take 
Joe = 1, Dee =9, FF pike, Tx} 


i Multinomial{ 1, z ,, }. (9) 


Assumption (9) specifies that the probabilities an individual 
belongs to one of the four tables depend on the two 
characteristics (i.e., row and column classifications) of the 
individual. In this manner we incorporate the assumption 
that the missing data is nonignorable. This is an extension of 
the model in Nandram, Han and Choi (2002). One can also 
have 1, or 7, instead of 7 ,,; the methodology is similar. 
Next, we need the likelihood function. Here the 
augmented likelihood function for p, 7, y,)!y, 1s 


4.r.c ap? sik 
elrmsota.monen if ellis } ne 


anwar 


Subject to Qj Lier Vijx =Mo» Lk= Yojye =4j,- J =. 
Dia Yan = MK H1,...,€, and Din Lier Vay = W- 
Observe that in (10) the parameters p, and 1,, are not 
identifiable. Clearly, to estimate p jk one needs to know 
y.> but only the y,, are known. Also, to estimate 7,, 
one needs to know Voin> 8 =) 3 Ave nus: Vor 5 = 2,364 
are also not identifiable. Putting very informative proper 
priors on the m,, will help, but this is not a practical 
solution. If an ignorable model (.e., 1, = 7, ) is used, then 
all the parameters can be identified. Therefore, a sensible 
solution is to attempt to link the 7, over (j,k) using a 
common feature. If the 2, come from a common distri- 
bution with “known” parameters, we would be able to esti- 
mate them. That is, we must attempt to “borrow strength” as 
in small area estimation. This permits estimation of y,,) 
which, in turn, will facilitate estimation of the p ik and Te sig 
For the 2, we “center” the nonignorable nonresponse 
model on the ignorable nonresponse model. Specifically, we 
assume that 


7, |B, T ~ Dirichlet CLT, 1 T; JEG, UT) 


Top 20 Ee = (11) 
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= lenyaryk=1yere. Un 11 f)=the paramiciemtritels ms 
about the closeness of the nonignorable nonresponse model 
to the ignorable nonresponse model. For example, if T is 
small, the 7 ik will be very different, and if tT is large, the 
7m, Will be very similar. Thus, inference may be sensitive 
to the choice of T, and one has to be careful in choosing T. 
In the absence of any information about nonignorability, it is 
natural to choose a prior density for Tt so that the non- 
ignorable nonresponse model generalizes the ignorable 
nonresponse model. This generalization is attained because 
as T goes to infinity, the 2, converge to the same value 
over (j,k) (not component-wise), the ignorable non- 
response model. The parameters pp and T are not iden- 
tifiable because the 1 ,, are not. Thus, it is impossible to 
estimate and t without any information; a natural way to 
proceed is to attempt to use some of the data already 
observed. 

Specifically, a priori we take and T to be independent 
with 

p(w) =Lu, 20, s =1, 2, 3, 4, 


3 = 1 Gamma (0,90, ee 0: (12) 

s=l 
where ©, and B, are to be specified; without any infor- 
mation about 0, and B, one needs to use the data again. 
To help specify a, and B, for the nonignorable non- 
response model, we have used the ignorable nonresponse 
model. The prior on Tt adds extra variation, thereby 
permitting some degree of nonignorability (see section 3.5). 
Note again that if tT is very large (i.e., &) >>B,), this non- 
ignorable nonresponse model degenerates into the ignorable 
nonresponse model. Thus, an issue of how sensitive infer- 
ence is to this specification arises. Of course, one can 
choose other distributions for t in (12) (e.g., lognormal 
distribution), but this is really not the key issue. 

Combining (2), (11) and (12), the joint prior density of 

7, p,p and T is 


ihe, ed Tk (pa =I Bot 
1p, 7, M, T) Ul 7h tem ED) 


jal kal D(ut) 


Note again that (13) is a proper prior density. Finally, 
combining the likelihood function in (10) with the joint 
prior density in (13) via Bayes’ theorem, the joint posterior 
density of the parameters 7, p, pf, 7 and the latent variables 


Yay 1s 
ig (1, DP; y* 
Tp, T, MW, T Vay | Y,)e I sae Dial ba 
Ss yak Y sik 
4 u,t-l 
Fac 7 | 
I] A Beets oohay oot Bot (14) 
je D(pD) 
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In Appendix A we show how to fit the nonignorable 
nonresponse model to obtain the appropriate inference using 
the Gibbs sampler. 


3.4 Bayes Factor: Tests of Association and 
Nonignorability 


We construct a test for the association between BMD and 
FI. This test is an assessment of the assumption that 
Pe = hj J Hl rk =1,...,¢, and Y..9,,; =1 and 
DY k=1 Yo, =1. We use the Bayes factor, the ratio of the 
marginal likelihoods under two scenarios (e.g., association 
versus no association). Note that we observe y,, but y,,) is 
a set of latent variables. So each marginal likelihood is 
simply the probability that y, is the observed value of Y,, 
which we denote by p(y,). 

We set 


(s 
Vay 2 Yau = UF mila ces F 


a3 Ya jk = 


jal k=l 


Ce 
Ya v,,k =1,...,¢; 


Then, sane d=3!n\(rc—1)! and e=3!n\(r—-1)'(c—-1)!, 
the marginal likelihood for the ignorable (IG) nonresponse 
model is 


Pio) = 
2 ys NJ il ((%, Pi)” / y 4! }dn dp, 
Yyec 5, j,k 
association 
(15) 
id Ds i] i {(% s1j42%) * / yy} mdq,dq,, 
ye 
no association, 
and letting Q, =(p,7,p,T) and Q,, =(9,,q),7, B, 7), 


the marginal likelihood for the nonignorable (NIG) non- 
response model is 


Pyic(¥,) = 
d xh, it (yp, ee LY ut} Ct, b,T)dQ.,, 
YE Sta 
association 
(16) 
622 oli, ne {(M sD Io) ™ | Vou! 38 (FB, T)AQ. 45 
Jaye 
no association, 
where 
Op -%—l=Bov + % 4 mist 
g(t, BW, gyal PbSY Th se eC) 


I'(Q,) ‘jsebed D(ut) 


The summation in the set C is computationally intensive 
because there are numerous points y,,,€ C (ie., we need to 


Zao 


sum over all of them). We avoid this problem by first 
summing over C analytically and the rest is obtained using 
Monte Carlo integration. 

For the ignorable model it is easy to show that 


Pig(¥,) = 
monk (Fol)! 
n+1(n+rc-1)!’ 
association 
3a! (r-pue-p! TT TT, a! 
nt V(n+r—I@re=I! TT] TL, ie! 


(18) 


no association, 


where n is the total number of individuals in the entire table. 
We describe how to estimate pyc (y,) in Appendix B. 

However, we note that a test for ignorability or non- 
ignorability is tenuous because we assume that there is no 
information about ignorability or nonignorability. Yet, our 
nonignorable nonresponse model is a generalization of our 
ignorable nonresponse model. We believe that the test about 
association under the ignorable nonresponse model or 
nonignorable nonresponse model is reliable. 

Finally, we note that the Bayes factor may be sensitive to 
prior specifications, especially when there are not enough 
data to estimate the parameters under test; see Sinharay and 
Stern (2002) for an interesting discussion on nested models. 
We have studied sensitivity of the Bayes factor with respect 
to the specification of , and B, in (17); see section 3.5 
and Table 6. This is useful because it is an important prior in 
our nonignorable nonresponse model. However, the main 
comparison is a test for no association under the ignorable 
nonresponse model and the nonignorable nonresponse 
model separately. The parameter t only enters the non- 
ignorable nonresponse model, and t has the same prior 
under association and no association. 


3.5 Specification of a, and B, 


The specification of the hyperparameters a, and f, in 
Tt ~ Gamma(©,, By) is a key issue in our method; see (12). 
This is important because we use this technique to robustify 
the ignorable nonresponse model; a sensitivity analysis is 
performed later. Note that E(t)=0,/B,; thus if a, >> 
B,, the nonignorable nonresponse model will be similar to 
the ignorable nonresponse model. Suppose we can observe a 
random sample qo from Gamma(a,, B,). Then, 
we can use a simple method (e.g., the method of moments) 
to estimate O, and Bp. 

How can we obtain a sample to fit Gamma(Q), B,)? 
The Gibbs sampler in (8) for the ignorable nonresponse 
model gives imputed values for the missing cell counts. We 
have imputed the missing cell counts M times, M =1,000; 
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let ni =y,, and nj, s=2,3,4, h=1,...,M denote 
the missing cell counts. Then, for each h we fit the 
nonignorable nonresponse model without the prior specifi- 
cation in (12), 


h I 
(n® (h) (h) 


(h) 
AG (GOED (5 Reacts OHS NE A 


~ Multinomial {7, (70,1; Pij>-++s MarePred}> 


p ~ Dirichlet(1), and x, ~ Dirichlet (a) 
where O, =U,T, s =1, 2, 3, 4. 


After integrating out p and 7,, we get the likelihood 
function, 


zal 
| a. 
iM 
R 
ee a 


a 
I 
is 
> 
ll 
= 


4 
np) (O,.+nGe 
i s=l 
a, >0,s=1,2,3,4. (19) 


Using the Nelder-Mead algorithm to maximize the like- 
lihood function in (19) over a, >0,5s=1,2,3,4, at the 
h" iterate, we obtain the maximum likelihood estimators 
a” ,h=1,...,M. Now letting 1 =>4,4”, we view 


Ss > 


t” h=l,...,M as a random sample from Gamma 
(Q,, sige 
Finally, using the method of moments, we fit 


Gamma(a,,B,) to the “data,” t”’,h=1,...,M, to get 
O,=a’/b and B,=a/b, where a=M'y%,<™ and 
b=(M -1) *>™, (t’” —a)’. Thus, we have constructed a 
data-dependent prior distribution for t. Our procedure gives 
QO, =125, B, =0.35 (ie., t has mean 357 and standard 
deviation 31.9). In section 4 we discuss sensitivity to this 
choice. 


4. Data and Empirical Analysis 


We apply our methodology to the data in the 3x3 
categorical table in Table 1. After we present results 
associated with the observed data and a sensitivity analysis, 
we describe a simulation study to assess the difference 
between the ignorable and the nonignorable nonresponse 
models. 


4.1 Data Analysis 


See Table 2 for a comparison of the ignorable 
nonresponse model and the nonignorable nonresponse 
model. We have also included the numerical standard error 
(NSE) which is a measure of how well the numerical results 
can be reproduced; we have used the batch-means method 
to compute it. Thus, one would be comfortable with small 
NSE’s relative to the Monte Carlo estimates or the posterior 
means. For both models the NSE’s are small with relatively 
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larger values for the nonignorable nonresponse model (both 
near zero any way), indicating that the computations are 
repeatable. The posterior means (PM) are very similar for 
the two models. The posterior standard deviations (PSD) are 
larger for the nonignorable model, making the 95% credible 
intervals wider. Virtually all the 95% credible intervals 
under the ignorable nonresponse model are contained by 
those of the nonignorable nonresponse model. 


Table 2 
Comparison of the Posterior Means (PM), Posterior Standard 
Deviations (PSD), Numerical Standard Errors (NSE), and 95% 
Credible Intervals (CI) for p from the Ignorable and Nonignorable 
Nonresponse Models 


Cell Pp PM PSD NSE GI 
(a) Ignorable Model 
Cal) a 40837 0.330 0.005 0.001 
(LZ) OSES, 0.142 0.003 0.001 
(a3) MOMS 0.168 0.004 0.001 
(2,1) 0.141 0.142 0.004 0.001 


(0.321, 0.339) 
(0.136, 0.147) 
(0.162, 0.175) 
(0.134, 0.148) 


C2) 0.071 0.066 0.002 0.001 (0.061, 0.070) 
(23) 101063 0.071 0.003 0.001 (0.066, 0.078) 
(3,1) 0.050 0.053 0.003 0.001 (0.048, 0.059) 
G6, 2), 7 0.016 0.016 0.001 0.000 (0.013, 0.019) 
(es) 9 ONG, 0.012 0.002 0.000 (0.009, 0.015) 


(b) Nonignorable Model 


Gea ae 30337) 0.321 0.020 0.009 
G2) Ots7 0.143 0.008 0.003 
(1,3) 0.154 0.173 0.014 0.007 
(2,1) 0.141 5139 0.019 0.009 
C2)e 0078 0.069 0.007 0.003 
(23) 290.063 0.071 0.013 0.006 


(0.278, 0.355) 
(0.126, 0.158) 
(0.140, 0.196) 
(0.109, 0.182) 
(0.056, 0.085) 
(0.053, 0.102) 


(3,1) 0.050 0.052 0.008 0.002 (0.040, 0.070) 
3,2) 0.016 0.019 0.003 0.001 (0.014, 0.026) 
G6. 3)O.0L0 0.013 0.003 0.001 (0.009, 0.020) 


Note: The ignorable nonresponse model has 1, =T,, 
s=1,2,3,4, 7=1,2,3,k =12,3. The opserved yaluear p 
based on the complete data is p. 

In Table 3 we have also compared the estimation of 7, 
in the ignorable nonresponse model with m,, in the non- 
ignorable nonresponse model. For the nonignorable non- 
response model we present the range of the posterior means 
(PM) for the nine cells of each s, s=1, 2,3, 4. This indi- 
cates the extent of the nonignorability. The PM’s of 7, are 
within the range of the 7,,, and as expected, the PSD’s are 
larger for the nonignorable model. For example, over the 
nine cells the 7, , vary from 0.388 to 0.656, and these two 
numbers differ significantly from 0.615, showing some 
degree of nonignorability. Thus, there is some difference 
between the ignorable and the nonignorable nonresponse 
models. 

In Table 4 we have presented the logarithms of the Bayes 
factors for testing the goodness of fit of the ignorable non- 
response model and the nonignorable nonresponse model. 
There is “strong” evidence that the ignorable nonresponse 
model fits better than the nonignorable nonresponse model 
for these data (Kass and Raftery 1995). While the ignorable 
nonresponse model provides “‘strong” evidence for no asso- 
ciation, the evidence from the nonignorable nonresponse 
model is “‘positive” as stated by Kass and Raftery (1995). 
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Thus, again there is a difference between the ignorable and 
the nonignorable nonresponse models. However, the NSE 
of 1.80 tends to nullify such differences. Our conclusion is 


that there is strong evidence to suggest no association 
between BMD and FI. 


Table 3 
Comparison of the Posterior Means (PM) and Posterior Standard 
Deviations (PSD) for sk trom the Ignorable and Nonignorable 
Nonresponse Models 


Ignorable Nonignorable 
Ty 0.615 (0.009) 0.388 (0.078) — 0.656 (0.044) 


a) 0.077 (0.005) 0.057 (0.017) — 0.195 (0.068) 
Tl 0.292 (0.008) 0.217 (0.041) — 0.349 (0.053) 
Ty 0.015 (0.002) 0.013 (0.005) -— 0.152 (0.055) 


Note: PSD’s are in parentheses. For the ignorable nonresponse 
model the parameters are 7,,%,%3 and m4 and for the 
nonignorable nonresponse model the parameters are 
To, S=1, Zot f=, 2,3; k=, 2,3. Among ‘the 
nine cells for each s we selected the smallest PM and the 
largest PM to form the range. 


Table 4 
Marginal Likelihoods and Bayes Factors for Testing Association 
Between BMD and FI Under the Ignorable and the Nonignorable 
Nonresponse Models 


Association No association Difference 
Ignorable —49.571 —46.173 —3.398 
Nonignorable —53.129 —50.132 —2.996 
NSE 1.800 1.790 
Note: All entries (marginal likelihoods and their differences) are on 


the logarithmic scale. The Monte Carlo integration uses 
50,000 iteractions. The NSEs, numerical standard errors, are 
small relative to the marginal likelihoods. 


We have considered the relation between BMD and FI 
when the osteopenia and osteoporosis levels are collapsed 
into one level. Under the ignorable nonresponse model the 
log Bayes factor is -2.77 (log marginal likelihoods: -32.82 
and — 29.05), and under the nonignorable nonresponse 
model the log Bayes factor is —4.52 (log marginal 
likelihoods: —34.25 and — 4.52). Thus, the same conclusion 
is reached about no association between BMD and FI. 
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We have also separated out the data into two age groups: 
premenopausal (age at most 49 years old; young) and 
postmenopausal (age at least 50 years old; old). For the 
young group there were only 4 females with osteoporosis, 
and so we collapsed the females with osteopenia and 
osteoporosis. We fit both the ignorable and nonignorable 
nonresponse models to these data and got similar results. 
For the old group using the ignorable nonresponse model 
the log marginal likelihoods corresponding to no association 
and association are — 43.01 and — 38.91 giving a log Bayes 
factor of 4.10 for no association. Thus, there is strong 
evidence for no association between BMD and FI. For the 
young group using the ignorable nonresponse model the log 
marginal likelihoods corresponding to no association and 
association are — 29.93 and —28.80 giving a log Bayes 
factor of 1.13 for no association. Thus, there is positive 
evidence for no association between BMD and FI for both 
age groups. Therefore, age is unlikely to play a role in the 
association of BMD and FI. 


4.2 Sensitivity Analysis 


We have studied the sensitivity of inference about the 
p With respect to the prior distribution of t. That is, we 
have taken tT ~ Gamma(ka,, B,), where « is a sensitivity 
parameter that we have taken to be | in our analysis (note 
that E(t) = Ka, /B,). 

Our procedure for the specification of &, and B, gives 
values of &, =125 and B, =0.35; see section 3.5. Making 
« bigger than 1 induces less changes in the posterior mean 
(PM) and posterior standard deviation (PSD) of the pj, 
than for « smaller than 1 because larger values of k 
induces much smaller changes in the prior distribution of T. 
In Table 5 we present PM’s and PSD’s of the p,, for 
K = 0.25, 0.50, 1.00, 2.00, 4.00. The PM’s increase with « 
and the PSD’s decrease as « increases from 0.25 to 4.00. 
Thus, there is some sensitivity to the specification of O, 
and B,, but the changes are small. For example, the PM’s 
of p,, are 0.31, 0.32, 0.33 at kK =0.25, 1.00, 4.00 and the 
PSD’s at these values of « are 0.04, 0.02, 0.01. 


Table 5 
Sensitivity of the Posterior Means (PM) and Posterior Standard Deviations (PSD) of the p jk tO Choices of k 
in the Nonignorable Nonresponse Model 


K 0.25 0.50 1.00 2.00 4.00 

Cell PM PSD PM PSD PSD PM PSD PM PSD 

GED) wi306.93. yy 3e09 315.01 25.81 321.81 19,95 325,31 14.55 326.16 10.46 
(1,2) 141.12 15,52 139.86 11.91 142.66 8.44 142.63 6.68 143.42 5.01 
(1,3)  dAGEL6S ~ 25.80 167.83 18.77 173.40 13.74 176.20 8.44 175.78 6.71 
(2,1) 143.18 34.20 142.62 24.92 138.57 18.82 PShzZ3 13.59 £S7°20 9.70 
(2,2) 68.46 {3.12 71.06 10.09 68.44 7.48 68.79 Die 68.11 4.45 
(2,3). yA GTS UZ 283 75.97 17.86 TiAl 12.56 68.09 7.84 68.34 6.38 
Dots 29.97 ou 53.50 1a 9a 52.14 7.76 50.97 ae 51.41 4.35 
Bs2y ™ 2043 7.76 20.02 4.89 18.96 23.28 18.67 2.78 17.84 aS 
(3, 3) 17.45 10.38 14.12 4.28 12:93 299 12.05 2.34 11.69 £99 


= 
Note: All entries must be multiplied by 107°. In the nonignorable nonresponse model T sik ~ Gamma (Kd9,Bo). 
where K is the sensitivity parameter and ag =125 and By =0.35. 
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We have also studied the sensitivity of the Bayes factors 
to choices of «k (see Table 6). First, the NSE’s decrease 
with «, but the change is small. Note that we have used 
50,000 iterations in the Monte Carlo integration; this sample 
size is needed for the Monte Carlo estimates to stabilize. 
The log marginal likelihoods do not change too much with 
«. Because the log Bayes factors are small, some changes 
are reflected in inference: At « =0.25,0.50, 4.00 there is 
“strong” evidence for no association, but at k =1.00, 2.00 
there is “positive” (borderline) evidence for no association. 
Overall, there is some degree of evidence for no association. 
Thus, it is interesting that one does not need to worry too 
much about the choice for (Q), By). 


Table 6 
Sensitivity of the Marginal Likelihoods and the Bayes Factor to 
Choices of « in the Nonignorable Nonresponse Model 


Association No Association — Bayes Factor 
K ML NSE ML NSE 
O25) 93.57 1.90 —49.16 1.89 —4.21 
0.50 -52.58 1.83 —49.49 1.82 —3.08 
1.00 -—52.58 1.80 —49.76 179 —2.82 
2.00 —52.81 79 —49.83 1.78 —2.98 
4.00 — —52.95 1.78 —49.91 ie! —3.04 


Note: All entries are on the logarithm scale. In the nonignorable non- 
u 
response model m,, ~ Gamma (Ko Bo), where k is the 
sensitivity parameter and a9 =125 and By =0.35. 


4.3 Simulation Study 


We have performed a simulation study to further 
compare the ignorable and nonignorable nonresponse 
models. Our objective is to confirm differences that exist 
between the two models. In our situation a test based on the 
Bayes factor can confirm one or the other. With limited 
information about nonignorability (our current situation), it 
is sensible to fit an ignorable nonresponse model because all 
the parameters are identifiable in the ignorable nonresponse 


model. Thus, we proceed by comparing the ignorable and 
nonignorable nonresponse models when data are generated 
from (a) the ignorable nonresponse model and (b) the 
nonignorable nonresponse model. This is a typical Bayesian 
analysis. 

We obtained the posterior means of the p,, and the 
T,.,», denoted by p, and 7,, respectively, after the non- 
ignorable nonresponse model is fit to the observed data. For 
the ignorable model we took 7, =)¥- Via Hy / 
rc, S=1, 2, 3,4. We _obtained> the~-cell— counts. for tne 
ignorable model by drawing from 


Se omental Beals iy plolbseten PME) WC 0 7 
os Multinomial{n, (Tt, Dae oer) sy Pid} 


and for the nonignorable model by drawing from 


(Yigg dee8 Vip aon Valio+ee> Vip oD? 


~ Multinomial {n, (#,,, By). +> tay, By) }> 


where n= 2,998, the total number of individuals in the 
original data set (see Table 1). We have generated 1,000 
datasets from each of the ignorable and nonignorable non- 
response models. Then, we fit the ignorable and non- 
ignorable nonresponse models to each dataset in exactly the 
same manner for the observed data in Table 1, and we 
computed the posterior means (PM) and the posterior 
standard deviations (PSD) for the p,. In Table 7 we 
present the averages of the PM’s and PSD’s over the 1,000 
datasets. The second column (labeled p ) has the posterior 
mean of p,, for the observed data under the nonignorable 
nonresponse model (see Table 2b). 

For (a) in Table 7 the PM’s are very close to the p, for 
the ignorable nonresponse model, but not so close when the 
nonignorable nonresponse model is fit. It is noticeable that 


Table 7 
Comparison of the Ignorable and Nonignorable Nonresponse Models Via the Simulated Data and the Posterior Means (PM) 
and Posterior Standard Deviations (PSD) of the p oe 


Simulated Ignorable (a) Nonignorable (b) 
Fitted Ignorable Nonignorable Ignorable Nonignorable 

Cell Pp PM PSD PM PSD PM PSD PM PSD 
(1,0) 321.81 320.73 672 307.42 11.30 332.02 5.10 324.44 10.60 
(1, 2) 142.66 142.96 4.24 146.44 7.34 141.81 3.30 143.44 5.43 
(1, 3) 173.40 172.59 4.42 173.49 7.62 168.66 4.14 174.10 7.04 
(2,.4) 138.57 138.82 4.81 135.52 9.82 143.63 4.52 139.20 9.74 
(2, 2) 68.44 68.44 3.55 72.01 6.02 64.51 vase t | 68.20 4.76 
(25:3) 71.11 71.41 3.65 75.00 6.30 70.85 3.76 69.63 6.58 
(3,4) 52.14 PI 3.11 53.03 4.95 53.08 3.04 52.44 4.70 
(3, 2) 18.96 | he i bo 2.08 21.65 2.98 15.08 bl Ti32 2.48 
(353) 12.93 13.54 1.78 15.64 2.55 10.95 1.85 11.20 2.18 
Note: Data are simulated from the ignorable nonresponse model in (a) or the nonignorable nonresponse model in (b), and both 


the ignorable and nonignorable nonresponse models are fit. We have generated 1,000 datasets, and we fit both the 
ignorable and nonignorable nonresponse models to each simulated dataset. The PM’s and PSD’s are averages over the 
1,000 datasets and p is the posterior mean for the observed data which we used to generate the data sets. All entries must 


be multiplied by Yipes 
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the PSD’s under the nonignorable nonresponse model are 
about twice as large as those under the ignorable non- 
response model. For (b) in Table 7 the PM’s for the non- 
ignorable nonresponse model are closer to the p, than 
those from the ignorable nonresponse model. However, in 
both cases the PSD’s for the nonignorable nonresponse 
model are about twice those from the ignorable nonresponse 
model. For example, in Table 7 for the (1, 1) cell as 
compared with 0.322 for p, in (a) the ignorable (non- 
ignorable) model gives a PM of 0.321 (0.307), but in (b) the 
ignorable (nonignorable) model gives a PM of 0.332 (0.324) 
for other examples. Thus, the two models are indeed 
different for estimating p. 

We have also considered estimating the proportion P of 
simulated datasets in which the ignorable nonresponse 
model performs better than the nonignorable nonresponse 
model. It is expensive to compute the marginal likelihood 
under the nonignorable nonresponse model. We note again 
that it takes 50,000 iterations for the Monte Carlo estimate 
to stabilize; this is an enormous task for the simulation study 
because we need to calculate the marginal likelihoods for 
1,000 datasets. Thus, we use a simple procedure to compare 
the two models, and we expect that this procedure would 
give a conclusion similar to a power calculation. 

Specifically, we compute A” = Ds ah DE 
PMEP)* PPM" 
p,, corresponding to the h™ dataset. We denote A” by 
A\@ for the ignorable nonresponse model and AY’, for the 
nonignorable nonresponse model. An estimator of P, P, is 
obtained by counting the number of the 1,000 experiments 
in which A‘? >A\),. For the data generated from the 
ignorable nonresponse model, P is 0.236 with a standard 
error of 0.013. For the data generated from the nonignorable 
nonresponse model, P is 0.920 with a standard error of 
0.009. Thus, if the ignorable nonresponse model is expected 
to hold, about 24% of the time the nonignorable non- 
response model will beat it, and if the nonignorable 
nonresponse model is expected to hold, only about (1 - 
0.920)100% ~ 8% of the time the ignorable nonresponse 
model will beat it. Thus, there are latent differences between 
these two models. The nonignorable nonresponse model 
does capture some degree of nonignorability, and it 
robustifies the ignorable nonresponse model. We believe 
that this is a reasonable comparison between the ignorable 
and the nonignorable nonresponse models. 


(p ie 
where PM‘? is the posterior mean of 


5. Concluding Remarks 


There are two key methodological developments in this 
paper. Specifically, we have shown that (a) it is possible to 
analyze multinomial data from rxc_ categorical tables 
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when there are both item and unit nonresponses, and the 
nonresponse mechanism may be nonignorable; and (b) by 
using the Bayes factor (ratio of the marginal likelihoods of 
two models), we can test for association between the two 
categories. Essentially, we have assumed that there is no 
information about nonignorability, all design features are 
suppressed and we have taken a conservative ground. 

For the 3x3 categorical data of BMD and FI, we have 
shown how to estimate the cell probabilities accurately. For 
the complete cases, the Bayes factor shows “strong” 
evidence for no association between BMD and FI. For all 
the data, our Bayes factor shows that the evidence for no 
association is “strong” under the ignorable nonresponse 
model, and is “positive” under the nonignorable non- 
response model. Thus, there is virtually no difference 
between the two scenarios: data from only the complete 
cases are used and all the data are used. Also, based on the 
Bayes factor and our simulation study, while there are 
differences between the ignorable nonresponse model and 
the nonignorable nonresponse models, such differences are 
small. There are differences for inference about the pro- 
portions of individuals in various BMD-FI levels; the 
posterior means are similar but the posterior standard 
deviations under the nonignorable nonresponse model are 
larger than those under the ignorable nonresponse model. 

Our simulation study supports two properties (subtle 
differences) of our models. First, the estimates of the cell 
probabilities from the ignorable (nonignorable) nonresponse 
model are closer to the true values when the ignorable 
(nonignorable) nonresponse model is expected to hold, but 
in either case the estimates from the nonignorable non- 
response model have about twice the standard deviations 
from the ignorable nonresponse model. Second, if the 
ignorable (nonignorable) nonresponse model is expected to 
hold, it can be beaten by the nonignorable (ignorable) non- 
response model. This happens a significantly larger pro- 
portion of time when the ignorable nonresponse model is 
expected to hold. Thus, there are differences between these 
models. We suggest fitting both models, and compute the 
Bayes factor to decide which one to use. We do not re- 
commend using these models when there are appropriate 
covariates and/or prior information to explain non- 
ignorability. 

In future research one can attempt to reduce the number 
of parameters in the nonignorable nonresponse model to 
further reduce the effects of nonignorability. For example, it 
may be possible to consider representing the data in two 
categorical tables as follows. The three supplemental tables 
are collapsed into a single supplemental table with its j” 
row having at least u, individuals, and its k™ column 
having at least v, individuals; the total number of indi- 
viduals in this supplemental table is w+ )'-, uj; + Lia Vy; 
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see section 3.1 for notations. Finally, we note that a full 
analysis of data from a complex survey requires an input of 
information (covariates and prior information) about non- 
ignorability, sampling weights and clustering effects as well. 


Appendix A 
Fitting the Nonignorable Nonresponse Model 


We show how to use the Gibbs sampler to make 
inference about the parameters in (14). The conditional 
posterior density of p is 


BL Y~ Dirchley do ye oe (A.1) 


and the conditional posterior density of 7 jk iS 
Vije THT, Vo jx (A.2) 
FU oT, V3 jx TH 37, Va jp TH 4T 
with Independence over’ f=[).. fish —(...,¢c 

We need the conditional posterior probability mass 
functions of y,,s=2,3,4 given Viaje Po Mins J bss 
k =1,...,c. From (14) it is clear that the y,, s = 2, 3, a are 


conditionally independent multinomial random vectors. 
Specifically, 


"zie Pedals 


T ix I{u,T, y} a Dice 


a * Multinomial(u pod vO. fcc go erp gh 


ates he jee J pel Sr Week rc} 
~ Multinomial(v,,q®), k=1,...5¢, 
Vain BoM asd le tal aK leo Ch 


~ Multinomial(w, 7 i st (A.3) 


CG) c an Cys 
where Di = My jp Pj! Liat Mr je P jy» K ar Odin = 
Si ee 
34 P jl LiyaMs Pie I=L... and qy =M Py! 
Lagat dudfatt Na gD Rod = |p cardink. = lier, © 


Next, we consider the hyper-parameters. Letting 6, = 
= [I= ™,, the joint conditional posterior density of 
Hl, T 1S 


p(y, tTit,, 7 =1,... 


- {t a} oq)" fe A) 
si 


where Si2y ron SO ese 1; SpetvessO: 

We use the grid method to get samples from the 
oa aha posterior density of p(p!T, 2 ee hd Mey oh 
k=1,...,c) and p(tip,2 po J=l,. i —9 Peg &) 
ia eivineronsbiite t to M/(1—@), the Saides now live 
on (0,1) with appropriate constraints, making the grid 
procedure convenient. We use 50 intervals of equal widths 
(obtained by experimentation) to draw p and @, and a 
random deviate for tT is @/(1—@). 
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The Gibbs sampler is executed by drawing a random 
deviate from each of the conditional posterior “densities”, 
(A.1), (A.2), (A.3), and (A.4) in turn, iterating the entire 
procedure until convergence. This is an example of the 
griddy Gibbs sampler (Ritter and Tanner 1992). 


Appendix B 
Estimation of p,,,(y,) in (16) 


Letting n,, denote the number of incomplete cases (i.e., 
n=Nny)+n,,), one can also show that for the model with 
association Pyg(y,)=a((n+1)!)/(ny!n,,!)A and for the 
model with no association Pyicg(y,)=b((n+1)!/(ny!n,,!))B, 
where a and b are given in (18), 


A= 
ie Vi jk 

{ {Ty te I]. 2% 

4 jk s=2 j,k DC yay Alen. a ¥ peeckh) 

Il, renee 0% Bot 
AU iesaee galn wecberc tg 
WYRE D(wT) “Tad 
Be 


ie {Tis HS 


Nm 
Randi | 
s=2) Jak 
yt 
I 97,’ 


Yi jk 
¥ ik: q 2k 
DV ag, hn caat Varist, TAL Giat tle eee 


Tie Oo qc %on! 6 Bot 
TL tae Pniniaia aie eioe (B.1) 
jx | D(pTt) T(Q, ) 


Note that 0< A, B<1 gives a useful diagnostic check on 
the computation. 

We show how to compute A in (B.1) using Monte Carlo 
integration; the procedure to compute B is similar. We 
prefer the simpler procedure based on Monte Carlo 
integration with an importance function (Nandram and Kim 
2002) rather than the method based on a continuation of the 
Gibbs sampler (Chib and Jeliazkov 2001). 

For A we choose the importance function 


I, Pe L og | 


Dy, +h... Vie +) Jk D(wT) 


4a has 
Tie pee 
Diz) Ta) 


Tim (Q, ) = 


where fi, and T are estimates obtained using a Gibbs 
output. We obtain a sample from 7,,,(Q,) by drawing 
t~ Gamma(Q,, B,), B + Pirichlet(jt, 1), a gle, t ~ 
Dirichlet(u,t) and pl y, ~ Dirichlet(y,,,+1,..., y,,. +). 
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Then, letting w,=D5 Lk Nix log min +n, logldt. D5 
Di THe PP 1- Dé (LF 1) + logy + log (Di), h= 
1,...,M, an estimator of A is A=M'y™, e®. The 
numerical standard error (NSE) of log(A) can be approx- 
imated. For letting O=M ">", m, and S*=(M-1)" 
>, (@,-@)’, we have  Var(A)=e?®S?/M, 
Var(log(A)) = (Var(A)/e?°) =~ S?/M, and the NSE is 
S/JM_ approximately. We start with M =10,000 
independent samples from the importance function, and 
increasing M until convergence which occurs about 
M =50,000. 
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On the Use of Data Collection Process Information for the Treatment of 


Unit Nonresponse Through Weight Adjustment 


Jean-Francois Beaumont ' 


Abstract 


Nonresponse weight adjustment is commonly used to compensate for unit nonresponse in surveys. Often, a nonresponse 
model is postulated and design weights are adjusted by the inverse of estimated response probabilities. Typical nonresponse 
models are conditional on a vector of fixed auxiliary variables that are observed for every sample unit, such as variables 
used to construct the sampling design. In this note, we consider using data collection process variables as potential auxiliary 
variables. An example is the number of attempts to contact a sample unit. In our treatment, these auxiliary variables are 
taken to be random, even after conditioning on the selected sample, since they could change if the data collection process 
were repeated for a given sample. We show that this randomness introduces no bias and no additional variance component 
in the estimates of population totals when the nonresponse model is properly specified. Moreover, when nonresponse 
depends on the variables of interest, we argue that the use of data collection process variables is likely to reduce the 
nonresponse bias if they provide information about the variables of interest not already included in the nonresponse model 
and if they are associated with nonresponse. As a result, data collection process variables may well be beneficial to handle 
unit nonresponse. This is briefly illustrated using the Canadian Labour Force Survey. 


Key Words: Nonresponse bias; Nonresponse model; Nonresponse variance; Number of attempts; Paradata; Response 


probability. 


1. Introduction 


Unit nonresponse is often handled in surveys by using a 
nonresponse weight adjustment method. The basic principle 
that is often chosen is to adjust the design weights by the 
inverse of estimated response probabilities (see, for exam- 
ple, Ekholm and Laaksonen 1991). These estimated re- 
sponse probabilities are obtained by postulating a model for 
the unknown nonresponse mechanism, which we call the 
nonresponse model. Key to reducing the nonresponse bias 
and variance as much as possible is to condition on a vector 
of auxiliary variables that are observed for every sample 
unit and that are good predictors of both nonresponse and 
the variables of interest (Little and Vartivarian 2005). 
Usually, the auxiliary variables are treated as being fixed 
both unconditionally and conditionally on the selected 
sample. 

In this note, we consider using Data Collection Process 
(DCP) variables as potential auxiliary variables to be 
included in the nonresponse model. An example is the 
number of attempts to contact a sample unit. Such type of 
data is sometimes called paradata (see Couper and Lyberg 
2005 for a recent reference on paradata) and has been used 
to deal with unit nonresponse by Holt and Elliott (1991), 
among others. In our treatment, contrary to Holt and Elliott 
(1991), DCP variables are taken to be random, even after 
conditioning on the selected sample, since they could 


change if the data collection process were repeated for a 
given sample. 

DCP variables may be particularly useful in cross- 
sectional surveys where the auxiliary variables available to 
handle unit nonresponse are often limited to variables used 
to construct the sampling design. Although such design 
variables are not useless, they are often neither very good 
predictors of nonresponse nor the variables of interest. The 
additional information from data collection process may be 
welcome in these cases. In longitudinal surveys, there is a 
wealth of potential auxiliary variables to deal with wave 
nonresponse. DCP information may thus not have the same 
importance to compensate for wave nonresponse than the 
importance it has to compensate for unit nonresponse in 
cross-sectional surveys. However, we have yet to study this 
in any depth. It may turn out that, at change points, DCP 
variables may matter greatly. 

In section 2, we introduce notation and theory concerning 
the effect of using random auxiliary variables in the 
nonresponse model when estimating population totals. This 
issue of the randomness of DCP auxiliary variables was 
raised and debated at Statistics Canada’s Advisory Commit- 
tee on Statistical Methods after the paper by Alavi and 
Beaumont (2004) was presented. The goal of section 2 is 
thus to shed some light on this issue. The use of DCP 
variables to adjust design weights for nonresponse is briefly 
illustrated in section 3, using the Canadian Labour Force 
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Survey (CLFS). The last section, section 4, contains a brief 
summary of the paper. 


2. Theory 


Let us assume that we are interested in estimating the 
population total t, =Dycy y, of a variable of interest y 
for a certain fixed population U of size N. From this pop- 
ulation, arandom sample s of size n is selected according 
to a probability sampling design p(s!|D), where D is a 
N —row matrix containing dj, in its k" row and d is the 
vector of design variables. Let also assume that, in the ab- 
sence of nonresponse, we would use the Horvitz-Thompson 
estimator f, =Dyc;W, ¥,, where w, =1/7, is the design 
weight of unit k and 1 , —P(kes) is its selection proba- 
bility. 

Usually, due to a number of reasons, unit nonresponse 
occurs so that the variable y is only observed for a subset s', 
of s, the respondents. Along with s, , a random vector z 
of DCP variables is also observed for every sample unit 
according to a joint mechanism #q(Z,, 5s, 1s, Y, D, X). 
As mentioned in the introduction, the number of attempts to 
contact a sample unit is an example of a DCP variable. The 
vector Zz of DCP variables and the set of respondents s, 
are random after conditioning on the selected sample since 
these quantities would likely take different values if the data 
collection process were repeated for a given sample. The 
quantity Z, is a n—row matrix containing z’, in its k™ 
row, Y is a N—element vector containing y, in its k™ 
element and X is a N—row matrix containing x; in its 
k" row. The vector x is a vector of additional fixed 
auxiliary variables. For instance, these auxiliary variables 
could come from an administrative file or, in a longitudinal 
survey, they could be the variables of interest observed at 
the previous wave. As a result, the vector x may not be 
available for nonsample units. Table 1 summarizes the 
availability of the different types of variables for the re- 
spondents, nonrespondents and nonsample units. 


Table 1 
Availability of Variables 
y Zz x d 
XESeX ES ta V ES YES 
NO 2h ES YES YES 
NG@caNO? WXESA2 VES 


Respondents: s , 
Nonrespondents: s-—s, 


Nonsample units: U —s 


* The vector z is not even defined for nonsample units. 
** The vector x may not always be available for non- 
sample units. 


The joint mechanism #q(Z,, s, 1s, Y, D, X) can be 
factorized into two distinct random mechanisms: 1) 
$(7, Us YD x ane ngs Y Dok 2 re 


Statistics Canada, Catalogue No. 12-001-XPB 


former is called the DCP mechanism while the latter is 
called the nonresponse mechanism. This factorization will 
be useful later to obtain properties of our nonresponse- 
weight-adjusted estimator defined in equation (2.2) below. 
We assume that 


Tes AVON. D, X, Zo = aq(s_ls,D,, X72. ae 


where D, and X, are the sample portions of D and X 
respectively. This assumption implies that the nonresponse 
mechanism is independent of (or unconfounded with) Y, 
after conditioning on s, D,, X, and Z,, and that the data 
are missing at random. However, we make no explicit 
simplifying assumption about the DCP mechanism so that it 
may well depend on Y, even after conditioning on s, D 
and X. 

To compensate for unit nonresponse, we consider the 
nonresponse-weight-adjusted estimator 


ANWA Wy 


t = TGS - C2) 
kes, Px (a) : 


yi 


where p,(a)=P(kes,|ls, D,, X,, Z,;a@) is the condi- 
tional response probability for a unit ke s and @ is an 
estimator of the vector of unknown nonresponse model 
parameters a. Note that a nonresponse model is a set of 
assumptions about the unknown nonresponse mechanism 
q(s,ls,Y, D, X, Z,); one of them being assumption 
(2.1). We assume that @ is implicitly defined by the equa- 
tion U, (@)=90, where U, (.) is a vector of g—unbiased 
estimating functions for a; that is, E,{U,(a)ls, Y, 
D, X, Z,}=0. Therefore, U, (.) is also p#g-— unbiased 
for a. In the remaining of the paper, we remove every- 
where the conditioning on Y, D and X when taking ex- 
pectations and variances since these vectors are always 
treated as being fixed. For instance, we will write 
E,{U,(a)ls, Z,}=0 instead of E,{U,(a)ls,Y,D, 
X, Z, }=0. This will simplify considerably the notation. 

Note that the nonresponse-weight-adjusted estimator 
(2.2) is implicitly defined by the equation 


Wy 


Kes) Px (@) 


(2.3) 


~A *“NWA “NWA 
U,(a,t,  )=t, ~- 


y, =0. 

If the nonresponse model is correctly specified and, in 
particular, if assumption (2.1) is satisfied, then the esti- 
mating function U,, (.,.) is p#q— unbiased for ¢, ; that is, 
E 4, {U2 (a,t, )}=0. To make assumption (2.1) as plau- 
sible as possible, it is important that the nonresponse model 
be conditional on design, auxiliary and DCP variables that 
are well correlated with y, provided that these variables are 
also associated with nonresponse. This recommendation 
should be useful to control the magnitude of the non- 
response bias, which may be unavoidable in real surveys. 
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This is also in line with the recommendation given in Little 
and Vartivarian (2005). Therefore, if DCP variables contain 
information about y above the information already con- 
tained in d and x, then the use of DCP variables may be 
useful to reduce the nonresponse bias if they are associated 
with nonresponse. 

Now, let @=(a’, 1, )’, 0=(4 7)" )’ and U(6)= 
{U, (G), U, (G7 Seyi for some vector 0 =(@’, t,)’. As 
noted above, 6 iy implicitly defined by the equation 
U(6)=0 and the estimating function U(.) is p#q- 
unbiased for @ since E ,,, {U(0)}=90. Using a first-order 
Taylor approximation (see Binder was we have 0 = 
6—{H(8)}"'U(®), where H(6)= {0U(6)/00’}. 
The matrix {H(@)}~' is thus given by 


{H,, (8)}" 0 
—H,,(6){H,,(6)}" 1 


where H,,(6)=E,,,(0U;(8))/(0@’), for i=1, 2. 
Using conditions similar to those of Binder (1983), 6 is 
asymptotically normal and asymptotically p#q-— unbiased 
for 6. As a result, ¢ bee “is asymptotically normal and 
asymptotically p#q-—unbiased for ¢,. Therefore, using 
DCP variables in the nonresponse model does not introduce 
any bias in the nonresponse-weight-adjusted estimator 
‘ * provided that the nonresponse model (specification of 
q(s,ls, D,, X,, Z,) and assumption 2.1) holds. Also, if 
the true unknown nonresponse mechanism depends on the 
sample portion of Y, Y,, after conditioning on s, D, and 
X,, then conditioning on a vector z of DCP variables is 
likely to reduce the nonresponse bias if the DCP mechanism 
depends on Y,, after conditioning on s, D, and X,, 
which means that the DCP variables contain information 
about y not already contained in d and x. 
Continuing our Taylor linearization, and using the fact 

that 


E 54 


{H(0)}" -( } (2.4) 


{U(O)}=V_E,, {UC)Is} 
+E,V,E, {U(8)|s,Z, } 
+E,,V, {U(6)Is,Z, }, 


Vo #4 


the p#g-— variance-covariance matrix of 6, V a (6), is 
approximated by 
(6) ={H(0)}' V,E,, {U(®)|s}{H’(0)} 


ee 


+{H(0)}"E,,V,E, {U(6)|s,Z, }{H (8)}" 


+{H(0)}"E,,V, {U(0)Is,Z,}{H(@)}". (2.5) 


The first term on the right-hand side of equation (2.5) is 
called the sampling variance of 8, the second term is called 
the DCP variance of @ and the third term is called the 


A ~ A ANW “5 
nonresponse variance of @. The variance V ee (yas 
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approximated by the value in the last row and in the last 
column of equation (2.5). Using expression (2.4) and the 
fact that E, {U(@)|s, Z, }=(0, é, -7,)’, the approxi- 
mate variance (2.5) reduces to 


Syren h Nee! 0 0 
Vin ®=(9 24 kt fp a 


+{H(8)}"E,,V,{U(8)ls, Z, }{H’(0)}". (2.6) 


The second matrix on the right-hand side of equation 
(2.6) corresponds to the DCP variance of 6 and contains 0 
for all its elements. Therefore, using random auxiliary 
(DCP) variables in the nonresponse model does not 
introduce any additional term of variance, as opposed to 
using only fixed auxiliary variables, when the nonresponse 
model is properly specified. Since DCP variables are likely 
to reduce the nonresponse bias if they are associated with y, 
then it seems beneficial to take advantage of them when 
handling unit nonresponse through a weight adjustment. 
Also, as pointed out by Little and Vartivarian (2005), adding 
auxiliary variables in the nonresponse model that are 
associated with y tends to reduce the nonresponse variance. 
The mean squared error can therefore be reduced on both 
counts. 

A more detailed expression for the nonresponse variance 
term in equation (2.6) as well as a sampling and a non- 
response variance estimator can be obtained similarly as in 
Beaumont (2005). Beaumont (2005) also discusses the 
effect of estimating the nonresponse model parameters on 
the variance of an estimator of a population total. 


3. The Example of the Canadian 
Labour Force Survey 


The goal of this example is not to provide every detail of 
the analysis that was conducted on the Canadian Labour 
Force Survey (CLFS) data but simply to describe some 
issues related to the choice of the nonresponse model and to 
the estimation of response probabilities. With these points in 
mind, we then go on to discuss the main conclusions that 
were reached. Greater detail about the results of the 
investigations in the CLFS, implementation of the new 
method and a comparison with the previous method can be 
found in Alavi and Beaumont (2004). 

The CLFS is a monthly survey with a stratified mullti- 
stage sampling design (Gambino, Singh, Dufour, Kennedy 
and Lindeyer 1998). The information used to construct the 
sampling design and to draw a sample of dwellings is essen- 
tially geographic. The sample is divided into six represent- 
tative rotation groups and each sampled dwelling stays in 
the sample for six consecutive months. One rotation group 
contains dwellings for which the members are interviewed 
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for the first time; another rotation group contains dwellings 
for which the members are interviewed for the second time 
and so on. Thus, for five rotation groups out of six, the 
sampled dwellings are common from one month to the next. 
Computer-assisted interviews are used to collect the survey 
information for every person in the selected households. 
With computer-assisted interviews, a large amount of DCP 
information is obtained for both responding and non- 
responding households. 

A logistic nonresponse model has been considered to 
model the unknown nonresponse mechanism q(s, 1s, D,, 
Z ,). With this model, the unknown response probability 
for household k is expressed as p, (a)={1+exp(-a’ 
(zd), )}~' and sampled households are assumed to respond 
independently of one another. The vector zd is a vector 
that contains DCP variables z, fixed design variables d as 
well as interactions between these two types of variables. 
No additional vector x of auxiliary variables was available. 
Two DCP variables were used: the number of attempts to 
contact a sampled household, which was divided into five 
categories, and the time of the last attempt, which was also 
divided into five categories. The design variables used were 
mainly geographic and also included the rotation group 
indicator. Due to potential interviewer and clustering effects, 
the above model may not be entirely realistic. It was used 
for its simplicity and because it appeared reasonable and an 
improvement over the previous method. Also, the estimated 
response probabilities resulting from this model were used 
only to provide a score and were not used directly to adjust 
design weights, as described below in this section. 

The unknown vector @ was estimated by the maximum 
likelihood method using the g —unbiased estimating func- 
tion 


U,(@)=>,..{%-Pe(a)}(zd),, GD) 


where r, =1, if ke s,, and r, =0, otherwise. Note that a 
design-weighted estimating function was not considered. 
This follows the practice recommended in Little and 
Vartivarian (2003) and can be justified by noting that the 
interest is in modelling the nonresponse mechanism only for 
sampled households kes (not for the whole population) 
and that this mechanism is conditional on s. Also, the DCP 
variables are not even defined outside the sample. The use 
of design weights does thus not make sense in this context 
and increases the variance of @ if the nonresponse model is 
correctly specified. Also, it is not clear that using a design- 
weighted estimating function would systematically bring 
robustness in this case. However, note that we do not ignore 
design information since it is included in the nonresponse 
model. This can be paralleled to the recommendation of 
including design information in imputation models (see, for 
example, Rubin 1996). 
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Stepwise logistic regression was performed for several 
months in order to determine appropriate design and DCP 
variables to be included in the final nonresponse model. In 
all months considered, the variable ‘number of attempts’ 
was the first to enter in the model and thus the most useful 
for explaining nonresponse. This variable was also corre- 
lated with the main variables of interest ‘employment’ and 
‘unemployment’. For instance, people belonging to respon- 
dent households with a large number of attempts, i.e. those 
that are difficult to reach, had a tendency to be more often 
employed (see Alavi and Beaumont 2004). Households with 
a large number of attempts had also a tendency to be 
nonrespondents. Therefore, it seems appropriate to give a 
larger weight adjustment to the responding households for 
which the number of attempts is large since their propensity 
to respond is lower and they are more likely to have 
characteristics similar to the nonrespondents. 

The final nonresponse model chosen fit reasonably well 
the CLES data in most months considered, according to the 
Hosmer-Lemeshow goodness-of-fit test. Nevertheless, the 
score method of Little (1986) was used to obtain some 
robustness against undetected model failures. The above 
logistic nonresponse model was first used to obtain an 
estimated response probability for every sampled household 
and then the sample was divided into about 50 homog- 
eneous classes with respect to this estimated response 
probability using the clustering algorithm implemented in 
the procedure FASTCLUS of SAS. This large number of 
classes was possible given the large CLFS sample size. It 
was chosen so as to reduce the nonresponse bias not only at 
the population level but also for smaller domains. The 
nonresponse weight adjustment for a responding household 
k within a given class c was simply computed as the inverse 
of the unweighted response rate within class c. A threshold 
on the nonresponse weight adjustment was set to 2.5 to 
control the nonresponse variance of the nonresponse- 
weight-adjusted estimator. When needed, the application of 
this threshold was necessary only for a very small number 
of classes. These were the classes with the smallest esti- 
mated response probabilities. Without this threshold, non- 
response weight adjustments around 4 could occasionally be 
observed. 

Another nonresponse model was considered in which the 
response probability for a household k is modelled as the 
product of the probability that household k be contacted, 
times the probability that this household respond, given it is 
contacted. The latter two probabilities were modelled sepa- 
rately. Although this model seems to be a better approx- 
imation of reality and gave slightly better results in the sense 
that it better explained nonresponse, the gains were not 
deemed sufficient to add this complexity in the nonresponse 
adjustment method. It may deserve further study. 
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4. Conclusion 


An important contribution of this paper is that DCP 
information must be treated as being random when used in a 
nonresponse model. We then have shown that the use of 
such information to handle unit nonresponse through a 
weight adjustment does not introduce any bias and that there 
is no additional variance component in the estimates of 
population totals when the nonresponse model is properly 
specified. Moreover, we have argued that if DCP 
information is associated with the variables of interest and 
with nonresponse, then its use is likely to reduce the 
nonresponse bias when the nonresponse mechanism 
depends directly on the variables of interest. We have also 
illustrated through the CLFS example that such information 
can be useful for dealing with unit nonresponse in a major 
survey. 

The full response estimator that we have considered is 
the Horvitz-Thompson estimator. Our conclusions would 
have remained the same had we used instead a generalized 
regression estimator. We have used the Horvitz-Thompson 
estimator for its simplicity and because it was sufficient to 
show our main point. 
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On the Correlation Structure of Sample Units 


Alfredo Bustos ' 


Abstract 


In this paper we make explicit some distributional properties of sample units, not usually found in the literature; in 
particular, their correlation structure and the fact that it does not depend on arbitrarily assigned population indices. Such 
properties are relevant to a number of estimation procedures, whose efficiency would benefit from making explicit reference 


to them. 


Key Words: Census; Survey; Sampling; Sample units; Probability function; Mean; Covariance. 


1. Introduction 


In recent times, population and household censuses, as 
we know them, have become more difficult to perform for a 
number of reasons. Alternative ways of securing more 
frequent information for the production of local, state and 
national statistical results have been proposed. Continuous 
large national surveys, among them those known as rolling 
censuses, with large sample sizes and complex designs, are 
being considered. 

However, in order to produce results at the local 
authority level the way a census does, different techniques 
for estimation as well as for validation and, in some cases, 
for imputation have to be developed and their efficiency 
improved. One way of achieving greater efficiency consists 
of taking into account all relevant information available. Of 
course, this includes the stochastic properties of sample 
units. 

In what follows, beginning from basic principles, we 
derive a general explicit form for the probability function of 
an ordered sample. We also show how that function, as well 
as the inclusion probabilities, can be computed. Finally, we 
give a general form for the correlation matrix of sample 
units, which depends solely on inclusion probabilities, so 
that linear and maximum-likelihood estimation procedures 
can benefit from it. 


2. The Basic Model 


The basic model we start from represents the sequential 
random drawing of n units from a population U formed by 
N such units, and may be stated as follows. Let N and n be 
two positive constants such that n < N, and let V represent 
an N Xn matrix, whose components are each distributed as 
Bernoulli random variables with, possibly, different para- 
meters. Then, 


Di, Di. D3 oT 
Dy, Vo» Dy; Do, 

Vuxn =| B31 D3) 033 hs D3, (1.1) 
On Ono Ons <> Own 


Also part of the model is the restriction imposed on each 
column of V to add to one. In other words, we require that 


N 


>On = 1. tonne lex... 7 (2) 
T=} 


be satisfied. 

This is required because if the j" draw results in 
population unit J being selected, then entry (/, j) takes the 
value of one while all other entries of column j are equal to 
zero. Note that this is equivalent to imposing a non- 
stochastic constraint on the behavior of all components of 
the i” column of V, regardless of the sampling scheme. 
Therefore, entries belonging to the same column do not 
behave independently. 

When sampling takes place with replacement (WR), the 
sum of the elements of the 7 row of the above matrix is 
distributed as a Binomial (n, p,) since each column is 
distributed independently of other columns. On the other 
hand, when sampling takes place without replacement 
(WOR), the total of row J can take only two values: one, if 
the 7" unit is drawn at some stage, or zero, otherwise, 
bringing us back to the Bernoulli case. 

Disjoint subsets of rows may be formed according to 
different criteria. For instance, when rows are grouped with 
regard to their spatial vicinity, one could speak about 
clusters or primary sampling units. When one or more 
statistical indicators form the basis for the groupings, the 
term strata is usually used. 
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Let us now define the inclusion probabilities as 


(k) 


mt, =P(population unit J in sample of size k) 


=Oat k=0. (2) 


Note that z\”=2,, commonly refered to as the 
inclusion probability for unit J. 

Now let ©, represent the j " column and 0,, the J” 
row of matrix V. Therefore, based on the following 
expression, 


F O21, 8.2, B32 +1 Don) = Far) Ff Go2 19.1) 


F(O.3 10.1, Boo) -- fF Oo, |Ba, Ben) (3) 


we can write the joint probability function of the elements of 
Vas: 


n [| N 
1 Qi Us Oe = lel [[ a” ay | 
kale f= 
n [ N 
= Tho) | (4) 
k=l | [= 
subject to 


= 


DIS = K=f.....0 ane 
p=| 


N 1, WOR 
SONS = Paee 
k=l n, WR 


and here p\"), defined as p\ =(n\ —n*), stands for 


the probability that population unit J is included in the 
sample at the k" draw. The above function is useful for 
calculating the probability of any ordered sample of size n. 
Clearly, when the order of inclusion can be ignored, the 
probability of a given sample would be obtained by adding 
the n! values obtained through (4). 


3. The Implications of Sampling on the Stochastic 
Properties of Population Units 


Consequently, 
E(8_) = py? = (my - 0”) (5) 


and therefore, we can write 


[ 2 n 
Po ee ee 
ee el eee dee 
EDVL= hase Bassai Pscagnan weenie (6) 
Be py py wa 
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From here, step-by-step inclusion probabilities, in WOR 
sampling situations, may be recursively computed, as is 
shown in (7), below. 


P if k=1 


ies este) (7) 
pi? YL if k>1. 
Pip leas Ti 

Note that (7) enables us to compute the desired probabi- 
lities at two different moments: first, when no draw has 
actually occurred, which explains why we average over the 
whole population, and secondly, when the result of the 
previous draw is known, at which time the probability of the 
J" population unit, say, entering the sample equals one and 
all other probabilities for that draw are equal to zero. Hence, 
at least in theory, we can compute the inverse of the so 
called expansion factors or weights for one stage sampling, 
or stage by stage in multistage sampling. Clearly, 


Ty =) Pr (8) 
k=l 


If we define the joint inclusion probabilities as 


population units 7 and 


nip =P( | i sah 
J in sample of size k (9) 


then we have that they can also be computed as follows: 
n-l n n 
Ty = SL oS? “S00 | (10) 
j=l k>j k>j 


For example, in simple random sampling WR 
(SRS/WR), expressions (7), (8) and (10) result in (7.1), (8.1) 
and (10.1), 


1 
pee mage when k 21 (ie) 
ay) 
Tilia (8.1) 
nO Ow. tw 
Try =e Py Deeds +p Pi 
j=l k>j k>j 
n—l = . _ . = 
-S(7 Jn £)-2e (10.1) 
Sian & paxil N 


While in SRS/WOR we get expressions (7.2), (8.2) and 
(10.2), instead. 


pr =< when k >1 (7.2) 
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a n 
Tc = N (8.2) 


n—| n n 

my = [ons DoutaDae De r) where J #1 
j=l k>j k>j 

nin 


wat) * : (10.2) 
M(N-1)} N(N-1) 


c n-l [ Wom j 
“| NN =1) 
Let us now consider the row vectors 0,,. Then, for the 
covariance matrix between different rows, we get 


Coy(0,.. Daa) = 


ere SOM 0 
0 — pm 0 (11) 
0 0 = py” py” 


nxn 


whenever / is different from J. 

When sampling takes place WR, and therefore, p‘”’ = 
p, V j=l,....n, the covariance matrix for the J" row 
vector is given by 


Cov(0,.; Os) = 


Prd 0 0 0 
0 P71 0 0 
0 0 Pair 0 (12.1) 
0 0 0 Pid 


nxn 
In a WOR setting the above covariance matrix becomes 


Cov(0,, ’ 0.) = 


pPa-p?) -pP ap 

hose 29) _ 32) p(n) 

ai P| 7 P| P; ee (12.2) 
BUD tiuctn ig Diesel. Solin gBi ed tnx 


Let 8 represent the N—dimensional vector which results 
from adding the columns of V. Clearly, the components of 
this vector may be expressed as the product of 0,, by a 
vector whose components are all equal to one. In other 
words, 


3) {del 
3, | | B21 

ees L0.s 14): (13) 
Ov) | Byeol 
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Some distributional properties of these sums may be then 
obtained directly from those of the rows or the columns of 
matrix V. 

For instance, their expected values are given as 


ECO Eek ile AS °, | 


k=] 


n n 

PEN te Oa) Oty 

Dip eee (1, 1h, = Ey (14) 
k=] k=2 


From (1.2), we get the non-stochastic restriction: 
l0=0,+0,+0,+..4+0,) =n. (15) 


From (14) and (15), well known propositions (16) and 
(17) follow immediately, 


ROG) iret ne? y aga) (16) 


Te Be Tg tp ea. On 


For the second order moments, we get 


Cov(d+, 0) = Covd’d;;,1' 3;.) 
=1 Could, 0,,)1= -)) Did. 


k=] 


- WR 
NP; Py (18) 


(ty omy ay)? WOR. 


which clearly indicates that the covariance is never positive. 
In turn, the variances are given by 


Var(0,) = Var(1' 8;,) =1 Cov(0,.)1 


np 1d WR 
at pa 7 (n) (19) 
Td — 7 WORK, 
Another important consequence of (15) has to do with 
the second order moments of the stochastic vector ©. 


0 = Var(n) = Var(I’ 8) = I’Cov(8)1=1CL (20) 


Clearly, the diagonal elements of matrix C, the 
covariance matrix of ©, are not all equal to zero. Therefore, 
randomly drawing a fixed-size simple introduces a 
dependency in the population units which results in non-null 
covariances implying that matrix C is singular. Otherwise, it 
is impossible for (20) to be satisfied. 

As a matter of fact, it is possible to prove that the sum of 
any row (or column) of C must be equal to zero, which is a 
stronger statement. Given that the covariance between a 
random variable and a constant equals zero, we get 
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0.=Cov(d;,n)=Cov(d;, DhOst Ope hD pp) 
= Cy + Cyn to + Cay 
= Var(d,) + >, Cov(d,,0,). (21) 
J#4I 
We have thus proven that in WOR sampling (22.1) holds. 
O= ny —-n)+ > (ap — 27°05”). (22.1) 
J4I 
The same statement can be proven algebraically by noting 
that 


(Gr) See (na) (n) 
ey, Tey pias 


JAI leat! 
es (n) 
=(n-1)t,; > 


which is obvious once we realize that the conditional 
probability involved represents the probability that popu- 
lation unit J enters a sample of size n—1 for which (19) 
also applies. Additionally, using (19) again, note that 

te ase, 


J#4I1 
and therefore, 


O= ny? —2y) + >) (ayy — 172%”) 
J#I 


Et ty +o) Tt, (nT, ). 


For WR sampling (21) implies: 
O=np,q, + > (n(n—-1)p,; Pp, - nD, Py) 


J4I 


= "Pq, — "P| a Py 


J#4I1 


(2272) 


which is immediately seen to apply. 

In any case, the most important implication of the above 
results is that regardless of the sampling scheme, the 
correlation matrix of the population random variables 
0,, 05, 03,..., Oy 1S singular. For the practical situations 
described in the introduction, the most important impli- 
cation of this fact lies mainly in the use made by many 
model fitting and estimation procedures of the inverse of the 
covariance matrix. 


4. The First Two Moments of Sample Units 


Once the first and second order moments of the vector 0 
have been established, we are in a position to determine the 
corresponding moments for sub-vectors of different sizes 
and whose components are randomly chosen, i.e., the 
sample. To this end, let us define the random variables 
0, 0,,,0;,,..-,0,, Where r represents the number of 
different population units in the sample, and whose indices 
I,,l<k<rsn,_ can take the value J with probability 
t;””. In other words, under the above conditions, we are in 
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the presence of a set of random variables whose indices are 
random themselves. 
4.1 Mean and Variance for WR Sampling 

For this case, the probability function of 0, is given by 


N 
PO, =x) =D) p, PCS, =x) 


n 
Xx 


I= 
N 

=» m( Jpia-nom. (23) 
I=1 


The first two moments may also be obtained via a condi- 
tional argument. The mean of its distribution is given by 


N N N 
E(8, =>) Pp, E(8,)=)) BD Pi =n) pi. (24) 
[=| Il Al 


In turn, its variance is computed using the well known 
formula 


V8, )=V, [E(8, 111+ E, (V8, 1, 25) 


In this case, we have 
E(o;, l1, =I)=np, 
and VO, l1, =I)=np,(—p;). (26) 


Hence, 
V, (EC, |F,)1=Vr, (pz, ) 
=n°[E, (pj, )-E;, (Pr, dh 
E, [V(®,, |I, l=nE,, (p,, -py, »I 
=nlE, (p;,)-E, (p;,)1 27) 


and therefore 
V(O, ) 
=nlE, (pr By (pp lt och, (pid Een all 
N A N : 
=Smi[1+(0-De, - no} (28) 
I=1 J=1 


For the case of SRS, (24) above results in 
he 1 fenalh n\) Non 
E(Or = ee (Ce a= ee 
Cid sgelp bal lad ies N 
While (28) yields 
v1 1 4 1 1 1 
V(0;,)= nt (1+(n-Do- ns J=ne(1-—} 
“ py N? N 2 N? NL ON 
4.2 Mean and Variance for WOR Sampling 
For this case, the probability function of 0, is given by 


1 ~ n . x —* 
PO, =x) =— Dim (Pry d-pry* 29) 


I=] k=l 
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and therefore 


E(8, )=— TDR EO)) 
n ‘ ] N 
=2Y APY) =+DaPy. G0) 
Nn ja j=l UF pet 


Using (25) again, we note firstly that 
E(8, \1;)= m7” and V(8, 11,) = Ty, (l= TY) 


from which we get 
VLE, 11) ))= (ny) = El(ny)° Pe [E(n;” 

and 

EV (8, 11, = Bln”? dn)” )] = El(ny? ))- l(a)”. 
Hence, the variance is given by 


V(0,) = E(n\)— EB? (n\) = EC [1 - E(n®”)] 


=(2 3 | 1-[2 Se? (31) 
N y= N j= 


Once again, in order to exemplify these results, let us turn 
to SRS. Expression (30) becomes 


fe cree 
E(8,)=—)) (ayy 
N j= 


teXk me n\N on 
+3 (2) (=) ene < 


Whereas (31) results in 


roo fedts)  CEl8)) 


= =(1 “ =) (33) 


4.3 The Covariance Between Sample Units 


In order to establish the covariance between different 
sample units we resort to a simple extension to (25), 


Cov(, ,0),)=Cov,,, [LE(;, IT), EO, 17,)] 
+E, ;, [Cov(d, ,0;, ee rele (34) 
In this case, we have that 


E(8, 1, =D= qe (35) 


and 


E(8, 9,1, =1,1,=D=my (36) 
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while the covariance between brackets on the right-hand 
side of (34) is easily seen to equal 


Cov(8, , 8,0, =1, 1, =D =m) -mns?. 37) 
From (35) and (36), we obtain 
Cov, [E(0,,11;), £0, 10 
=E, (mT) ae (m;” E;, (my) (38) 
whereas from (37) we get 
E, , [Cov(, , 9), 17;,1,)] 
E, 1, tr) Eps, (tM, )- (39) 


Finally, adding these last two expressions we arrive at the 
desired covariance 


Cov(U, ,9;,) 
= 4 (7077, LE, (%,” ILE, (7) 


1 


> Deny by ae Din) a (40) 


ee l) 75 
vad 


In the SRS/WR (40) results in 


fe ide (41) 


while for the WOR case the covariance can be seen to equal 


2 
1 Es lia!) 
n(n — seats *) 


Deer 2" 
eae 
= tiles iets 
~ N(N=-1) N? 


n(N —n) 
gig ee 42 
[aren | So 


It should be stressed that for SRS, regardless of whether 
it takes place with or without replacement, the correlation 
coefficients are given by 


Cov(t, ‘ v,, = 
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=i 


Com(d, yy, = N=’ 


(43) 
independently of the sample size. 

Furthermore, we have that, as the value of n approaches 
that of N in WOR sampling, both 2’ and 1‘) approach 
one. In particular, when n=WN, the values of expressions 
(31) and (40) become zero. 


5. The Correlation Matrix for Sample Units 


Once we realize that none of the expressions in (28), (31) 
and (40) depend on any of the arbitrary indices used to 
differentiate population units, it should become clear that the 
rXr_ correlation matrix for the random vector 9= 


(0, ,0;,,0),,..-, 0, ), where r <n, may be written as: 
Corr(®)=R.(p)=|p p 1 -: pl (44) 


It should be noted that the elements of R.(p) in (44) 
depend only on the inclusion probabilities which, for any 
sample size, may be fully computed from recursion (7), and 
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expressions (8) and (10). In other words, they do not depend 
on any unknown population parameters to be estimated nor 
on the values of the variables to be measured on the sample 
units. 


6. Final Remarks 


In theory, the efficiency of every estimation procedure 
will experience some gain whenever explicit allowance for 
the correlation between sample units is made. This would 
certainly be the case for linear as well as for some instances 
of maximum-likelihood estimation. 

On the other hand, it should be emphasized that R,,(p) 
may become singular as the sample size n approaches the 
population size N; this is the case for SRS (Ry (-1/(N —)) 
as well as for WOR sampling in general. Therefore, numeri- 
cally, many estimation procedures which rely on the inverse 
or the determinant of R, rather than on the correlation matrix 
itself, may also benefit from replacing the simplifying 
assumption of independence between observations by a 
more realistic one of correlated observations whenever 
sample sizes are large relative to population sizes. Instances 
where this can happen are given by some stages in multi- 
stage sampling (e.g., number of households in a block) and 
by large country-wide surveys. 
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Algorithms and R Codes for the Pseudo Empirical Likelihood 
Method in Survey Sampling 


Changbao Wu ' 


Abstract 


We present computational algorithms for the recently proposed pseudo empirical likelihood method for the analysis of 
complex survey data. Several key algorithms for computing the maximum pseudo empirical likelihood estimators and for 
constructing the pseudo empirical likelihood ratio confidence intervals are implemented using the popular statistical 
software R and S-PLUS. Major codes are written in the form of R/S-PLUS functions and therefore can directly be used for 


survey applications and/or simulation studies. 


Key Words: Confidence interval; Bi-section algorithm; Empirical likelihood; Newton-Raphson procedure; Stratified 


sampling; Unequal probability sampling. 


1. Introduction 


One of the major challenges in applying advanced and 
often sophisticated statistical methods for real world surveys 
is the computational implementation of the method. Prac- 
tical considerations often rule out the use of methods which 
are theoretically sound and attractive but are computa- 
tionally formidable. 

The empirical likelihood method first proposed by Owen 
(1988) is one of the major advances in statistics during the 
past fifteen years. In addition to its data driven and range 
respecting feature in estimation and testing, its non- 
parametric and discrete nature is particularly appealing for 
finite population problems. Indeed an early version of the 
method, the so-called scale-load estimators, was used in 
survey sampling by Hartley and Rao back in 1968. The 
more recent investigation of the method in survey sampling 
has resulted in a series of research papers and generated 
noticeable interests among survey statisticians to further 
explore the method. Wu and Rao (2004) contains a brief 
summary on the recent development of the pseudo empirical 
likelihood (PEL) method in survey sampling. 

Progress on algorithmic development for the PEL 
method has also been made. A modified Newton-Raphson 
procedure for computing the maximum PEL estimators 
under non-stratified sampling was proposed by Chen, Sitter 
and Wu (2002). The procedure was further modified by Wu 
(2004a) to handle stratified sampling designs. 

In this article we present computational algorithms for 
computing the maximum PEL estimators and for construc- 
ting the related PEL ratio confidence intervals for complex 
surveys under a unified framework, with particular interest 
in implementing those algorithms using R and S-PLUS. 
The software package R, a friendly programming 


environment and compatible to the popular commercial 
statistical software S-PLUS, is attracting more and more 
users from the statistical community. What is advantageous 
about using R is that it is available free for research use and 
the package may be easily downloaded from the web. It is 
hoped that this article will bridge the current gap between 
theoretical developments and practical applications of the 
PEL method and will generate more research activities in 
this direction to make fully practical use of the PEL method 
a reality. 

The algorithm for computing the maximum PEL 
estimator under non-stratified sampling and some notes on 
its implementation in R/S-PLUS are presented in section 2. 
The algorithm of Wu (2004a) for stratified sampling is 
discussed in section 3. Construction of the PEL ratio 
confidence intervals involves profiling the pseudo empirical 
likelihood ratio statistic and is detailed in section 4. All R 
functions or sample codes are included in the Appendix. 
They can also be downloaded from the author’s personal 
homepage http://www.stats.uwaterloo.ca/~ cbwu/paper. html. 
These functions and codes had been tested in the simulation 
study reported in Wu and Rao (2004) and were observed to 
perform very well. 


2. Non-Stratified Sampling 


Consider a finite population consisting of N identifiable 
units. Associated with the i" unit are values of the study 
variable, y,, and a vector of auxiliary variables, x;. The 
vector of population means X = N7'>™, x, is known. Let 
{(y;, X;), 1€ s} be the sample data where s is the set of 
units selected using a complex survey design. Let 
m™, =P(i € s) be the inclusion probabilities and d; =1/7, 
be the design weights. 
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The pseudo empirical maximum likelihood estimator of 
the population mean Y=N' >, y, is computed as ¥,,, = 
Dies P; y; Where the weights p, are obtained by maxi- 
mizing the pseudo empirical log likelihood function 


1,,(p) =n >; log(p;) (2.1) 
subject to the set of constraints 
0< p.<t. > p= and op x eX 


The original pseudo empirical likelihood function proposed 
by Chen and Sitter (1999) is 1(p)=>%,.,d; log(p;). The 
pseudo empirical likelihood function /,,.(p) given by (2.1) 
was used by Wu and Rao (2004), where d; =d,/¥;j-, d, 
are the normalized design weights and n° is the effective 
sample size. The point estimator cae =Die;P; ¥; Temains 
the same for either version of the likelihood function. The 
rescaling used in /,.(p) facilitates the construction of the 
PEL ratio confidence intervals. 

Using a standard Lagrange multiplier argument it can be 
shown that 


* 


d 


A i 


feral a GadetXE) eee 


for i€ Ss, 
where the vector-valued Lagrange multiplier, A, is the 
solution to 
d; (x,-X) 
N = te = 0. 
8-2 Gk) 

The major computational task here is to find the solution to 
g,(A)=0. This can be done using the modified Newton- 
Raphson procedure proposed by Chen et al. (2002). The 
modification involves checking at each updating stage that 
the constraint 1+2(x,-X)>0 (ie, p,>0) is always 
satisfied. Without loss of generality, we assume X =O (if 
not, replace x, by x,—X throughout). The modified 
procedure is as follows. 


Step 0: Let 4, =0. Set k =0, y) =1 and e€=10-. 
Step 1: Calculate A,(A,) and A,(A,) where 


* x. 
A, (AY=> 4d, ; 
ae 2 TE, 
and 
i =i 
* xX. X. 
A,(4)=4—-) .d, —++—} A,(). 
ah 2 eo} Me 


If IlA,(A,) Il < €, stop the algorithm and report 2,; 
otherwise go to Step 2. 

Step 2: Calculate 5, =y, A,(A,). If 1+ (A, -6,)' x; $0 
for some i, let y, =y,/2 and repeat Step 2. 
Step 3: Set A,,,=A4,—-6,,k=k+1 
(k +1)". Goto Step 1. 


and ¥%j4, = 
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In the original algorithm presented by Chen et al. (2002), 
their step 2 also checks a related dual objective function. 
While this is necessary for the theoretical proof of 
convergence of the algorithm, it is not really required for 
practical applications. 

The R function Lag2(u,ds,mu) can be used for finding 
the solution to g,(A)=0 when the vector of auxiliary 
variables x is of dimension m and m22. When x is 
univariate, an extremely simple and stable bi-section 
method to be described shortly should be used. Let n be the 
sample size. The three required arguments are the nxm 
data matrix u, the nX1 vector of design weights ds and the 
mX1 population mean vector mu. The output of the 
function Lag2(u,ds,mu) returns the value of 2 which is the 
solution to g,(A) =0. 

The function Lag2(u,ds,mu) will fail to provide a 
solution if (i) the mean vector X is not an inner point of the 
convex hull formed by {x,,i¢€ s}, or (ii) the matrix 
Yi-;d; X, x; is not of full rank. In case (i) the pseudo 
empirical maximum likelihood estimator does not exist. 
This happens with probability approaching to zero as the 
sample size n goes to infinity; in case (ii) one may consider 
to remove some components of the x variables from the set 
of constraints (2.2) to eliminate the collinearity problem. 

When the x variable is univariate, so is the involved 
Lagrange multiplier 2. In this case we need to solve 
£3 (NSN dP, (FNS) =O Hora "scalar OPP ase 
suming X =0. A unique solution exists if and only if 
min{x;,i€ s} < 0 < max{x,;,i es}. The solution, if 
exists, lies between L=-—1/max{x,,ie s} and U= 
—1/min{x,,i€ s}. Noting that g,(A) is a monotone 
decreasing function for A € (L, U), the most efficient and 
reliable algorithm for solving g,(4)=0 is the bi-section 
method. The function Lagl(u,ds,mu) does exactly this, 


where the required arguments are u =(x,,..., x,), ds = 
(d,,..., d,) and mu =X. The output returns the solution 
to g,(A)=0. 


The function Lag1(u,ds,mu) can be used in conjunction 
with the model-calibrated pseudo empirical likelihood 
(MCPEL) approach of Wu and Sitter (2001) to handle cases 
where the x variable is high dimensional. The MCPEL 
approach involves only a single dimension reduction 
variable derived from a multiple linear regression model and 
the related Lagrange multiplier problem is always of 
dimension one. 


3. Stratified Sampling 


Let {(y,;, X,;), ©€ S,, h=1, .... H} be the sample 
data from a stratified sampling design. Let d,,= 
dy; / Xie s,4); be the normalized design weights for stratum 
h, h=1, ..., H. The pseudo empirical likelihood function 
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under stratified sampling defined by Wu and Rao (2004) is 
given by 


H 
ise Gans) J DaWynerdy log (py, orl) 
h=1 iE Sp 
where W, = N,,/N are the stratum weights and n° is the 
total effective sample size as defined in Wu and Rao (2004). 
The value of n° is not required for point estimation but this 
scaling constant is needed for the construction of confidence 
intervals. Let X be the known vector of population means 
for auxiliary variables. The maximum pseudo empirical 
likelihood estimator of the population mean Y= 
Ln=i WY, is defined as You, = Liar Wi, Lies, Pri Yai 
where the p,; maximize /,,(p,, .... Pj) Subject to the set 
of constraints 
Die OD PYM PRS IVAN 


iE S$), 


2M, Dey, Sale Xs, 


i€ S,, 


and 
(32) 


The major computational difficulty under stratified 
sampling is caused by the fact that the subnormalization of 
weights (7.€., Dies, Pp; =1) occurs at the stratum level while 
the benchmark constraints (i.e., ©), W, Dies, Pai Xn =X ) 
and the constrained maximization of the PEL function are 
taken at the the population level. The algorithm proposed 
by Wu (2004a) for computing the f,; proceeds as follows: 
let x,; be augmented to include the first H —1 stratum 
indicator variables and X be augmented to include 
(W,, .... W_,) as its first H —1 components. In the case of 
no benchmark constraints involved, the augmented x 
variable will consist of the AH —1_ stratum indicator 
variables only and X =(W,, ..., W,,_,). It follows that the 
set of constraints (3.2) is aa to 


ne 0} > W, pt 


iE Sp 


and 


(3.3) 


>, Dp t= X, 


LE S, 


where the x variable is now augmented. Let u,;= 
x,; —X. It is straightforward by using a standard Lagrange 
multiplier argument to show that 


* 


nj 
1+)'u,, 


A 


Pin 


with the vector-valued 4 being the solution to 


d, 
Madan, rane 


i€ S;, iho U); 


The modified Newton-Raphson procedure of section 2 for 
solving g,(4)=90 can be used for solving g,(4)=0. The 
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key computational step under stratified sampling designs is 
to prepare the data file into suitable format so that the R 
function Lag2(u,dsmu) for non-stratified sampling can 
directly be called. Sample R codes for doing this are 
included in the Appendix. 


4. Construction of PEL Ratio 
Confidence Intervals 


While the computational algorithms for the maximum 
PEL estimator under non-stratified and stratified sampling 
designs are somewhat different, the search for the lower and 
the upper boundary of the pseudo empirical likelihood ratio 
confidence interval for Y involves the same type of profile 
analysis. Under non-stratified sampling designs, the 
(1—a)-—level PEL ratio confidence interval of Y is 
constructed as 


{81 r,,(8) <x; (@)}, 


where ¥;(@) is the 1—o quantile from a y* distribution 
with one degree of freedom. The pseudo empirical log 
likelihood ratio statistic r,,(8) is computed as 


AQAA, (B) sb ali)}» 
Where the p maximize /,,(p) subject to the set of 
“standard constraints” such as (2.2) and the p maximize 
1;(p) subject to the “standard constraints” plus an 
additional one induced by the parameter of interest, Pure 
> Pi ¥i = 8 (4.2) 


i€s 


(4.1) 


To compute p one needs to treat (4.2) as an additional 
component of the “standard constraints” for each fixed 
value of 8 so that the maximization process is essential the 
same as before. 

Let (£, U) be the interval given by (4.1). Our proposed 
bi-section method in searching for L and U is based on 
following observations: 


i) The minimum value of 7,,(8) is achieved at 0= 
Lies P; ¥; =Ypm.. In this case p = p and r,,(8)= 
0. 

ii) The interval (£, U) is bounded by (Yqay> Yn) where 
Yay=min{ y,,ies} and y,,,=max{ y,,ies}. 

iii) The pseudo empirical likelihood ratio function 
r,s(8) 1s monotone decreasing for 8 € (Yq), ye) 
and monotone increasing for 8 € (Ypp,, Yin) )- 


Conclusion ui) can be reached by noting that /,.(p) does 
not involve 6 and 1,,.(p)=n Yie,d, log (p;) is typically a 
concave function of 9. It is also possible to show this by 
directly checking dr,.(8)/d®. For instance, in the case of 
no auxiliary information involved, the “standard con- 
straints” are p; >0 and >.,p;=1. The p; are given by 
d; and You = Die. d; y,. The p, are computed as 
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* 


d. 
RS 4.3 
HORE Gah O) ae 
where the A is the solution to 
sy 4.075 Dixit (4.4) 
ies 1+A(y; —8) 
Using (4.3) and (4.4), and noting that ¥;.,d,/ 
ei —@))=1, itis iil = to show that 
7 drool * {(dd/ d®) (y, -9)-A} B obit 
= ies 1+AQy; oa 8) 
By re-writing d, (y,—®) as d,(y, —9) [{1+A(y; -9)} 


—i(y; —9)] and after some apie in (4.4) we get 


ny, A Ay a 


ie€s TAGES ies 


It follows that dr,,(0)/d0@=-2n' ’<0 
Sice d, y,=Yom, and dr,,(0)/d@>0 otherwise. 

Sample codes for finding (£, UV) where no auxiliary 
variable is involved are included i in the Appendix. In this 
case p,=d; and You =Dic,d, y,=Y, is the Hajek 
estimator for Y. The profiling process involves finding A 
for each chosen value of @ and evaluating the PEL ratio 
statistic 7,,(@) against the cut-off value from the y; 
distribution under the desired confidence level 1—a. With 
auxiliary information, one needs to modify the computation 
of r,,(8) for each fixed 8. The bi-section search algorithm 
for finding £ and U remains the same. 

The value of the effective sample size n° is required for 
computing the PEL ratio statistic r,,(8). For non-stratified 


sampling designs it is sia asin = Ne */V (y) where 


‘e 
S; -== wap lho 


tes) JS? a 


ito 


and 
2) 
Cone 
V(y)= alee 
s 6 pe acca = a1 “. 
where e; = =o and fis =N'y;,.,d; y,;. See Wu and 


Rao (2004) ie further detail. Computation of n involves 
the second order inclusion probabilities 1, which may 
impose a real challenge if a mps sampling scheme is used. 
In the simulation study reported in Wu and Rao (2004), the 
Rao-Sampford mps sampling method was used. R 
functions for selecting a tps sample using this method as 
well as for computing the related second order inclusion 
probabilities can be found in Wu (2004b). Similar R 
functions are also available in an add-on R package called 
“pps”, written by J. Gambino (2003), which can be 
downloaded from the R homepage http://cran.r-project.org/ 
by clicking the packages option. 
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Appendix: R/S-PLUS Codes 
A1. R Function for solving g,(4) =0. 


Let m be the number of auxiliary variables involved and 
m=2. There are three required arguments in the function 
Lag2(u,ds,mu): 


(1) u: the nxm data matrix with x, as its i" row, 


bas hy G50 


(2) ds: the nx1 vector of design weights consisting of 
ee. 


n° 


(3) mu: the mx1 population mean vector X. 


The output of the function is the solution to g,(A) =0. 


Lag2<-function(u,ds,mu) 


n<-length(ds) 
u<-u-rep(1,n)%*%t(mu) 
M<-0*mu 
dif<-1 
tol<-1e-08 
while(dif>tol){ 
D1<-0*mu 
DD<-D1%*%t(D1) 
for(i in 1:n){ 
aa<-as.numeric(1+t(M)%*%uli,]) 
D1<-D1+ds[i]*u[i,/aa 
DD<-DD-ds[i}*(ufi,]%*%t(ufi,]))/aa*2 


} 
D2<-solve(DD,D1,tol=1e-12) 
dif<-max(abs(D2)) 
rule<-1 
while(rule>0){ 
rule<-0 
if(min(1+t(M-D2)%*%t(u))<=0) rule<-rule+1 
if(rule>O) D2<-D2/2 


} 
M<-M-D2 


} 
return(M) 


} 
A2. R Function for solving g,(A)=0. 


When the x variable is univariate, the solution to 
g,(A)=0 can be found through a simple and reliable bi- 
section method. The three required arguments for the 
function Lagl(u,dsmu) are u=(x,...,x,), ds= 
(d,,....d,) and mu =X. The output is the solution to 
g,(A)=0. 
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Lag1<-function(u,ds,mu) 
{ 
L<--1/max(u-mu) 
Re<--1/min(u-mu) 
dif<-1 
tol<-1e-08 
while(dif>tol){ 
M<-(L+R)/2 
glam<-sum((ds*(u-mu))/(1+M*(u-mu))) 
if(glam>0) L<-M 
if(glam<0) R<-M 
dif<-abs(glam) 


return(M) 
} 


A3. Sample code for stratified sampling. 


We need to call the function Lag2(u,ds,mu) from 
nonstratified sampling. The key step is to prepare the data 
file into suitable format. Let 


(1) n=(n, ..., n,) be the vector of stratum sample 
SIZES. 


(2) x be the data matrix with x,, as row vectors, 
pL a. Nita l....o lh 


Ciedst=dah.0s dea s bgidipte. <u; ae ), where d,, 


are the normalized initial design weights for 
stratum h. 


(4) X be the vector of known population means. 


(5) W =(W,,....W) be the vector of stratum 
weights (.e., W, =N,,/N ). 


The following sample codes show how the solution to 
g;(A)=0 is found (M from the second last line of the 
following code) and how the p,,’s are computed (phi from 
the last line). 


HHH 
nst<-sum(n) 
k<-length(n)-1 
ntot<-rep(0,k) 

ntot[1]<-n[1] 

for(j in 2:k) ntot[j]<-ntot[j-1]+n{[j] 
ist<-matrix(0,nst,k) 

ist[1:n[1],1]<-1 

for(j in 2:k) ist[(ntot[j-1]+1):ntotfj],j}<-1 
uhi<-cbind(ist,x) 
mu<-c(W[1:k],X) 
whi<-rep(W[1],n[1]) 

for(j in 2:(k+1)) whi<-c(whi,rep(W{j], n{j])) 
dhi<-whi*ds 
M<-Lag2(uhi,dhi,mu) 
phi<-as.vector(ds/(1+(uhi-rep(1 ,nst)%*%t(mu))%*%M)) 
HHH 


A4. Sample code for finding the PEL ratio confidence 
interval. 


The search for the lower boundary (LB) and the upper 
boundary (UB) of the PEL ratio confidence interval needs to 
be carried out separately. The following codes show how 
this is done for the case of no auxiliary information. With 
auxiliary information, one needs to modify the computation 
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of the involved pseudo empirical likelihood ratio statistic 
(elratio) accordingly. Let 


(1) a=l-aq be the confidence level of the desired 
interval. 


(2) ys =(),, ---, y,) be the sample data. 

(3) 20s. —= (d; aR ds) be the normalized design 
weights. 

(4) YEL =)ic; p; y; (in this case p; =d;). 

(5) nss be the estimated effective sample size n’. 


HHH 

tol<-1e-08 

cut<-qchisq(a, 1) 

HH 

t1<-YEL 

t2<-max(ys) 

dif<-t2-t1 

while(dif>tol){ 
tau<-(t1+t2)/2 
M<-Lag1 (ys,ds,tau) 
elratio<-2*nss*sum(ds*log(1+M*(ys-tau))) 
if(elratio>cut) t2<-tau 
if(elratio<=cut) t1<-tau 
dif<-t2-t1 


} 

UB<-(t1+t2)/2 

iad 

t1<-YEL 

t2<-min(ys) 

dif<-t1-t2 

while(dif>tol){ 
tau<-(t1+t2)/2 
M<-Lag1(ys,ds,tau) 
elratio<-2*nss*sum(ds*log(1+M*(ys-tau))) 
if(elratio>cut) t2<-tau 
if(elratio<=cut) t1<-tau 
dif<-t1-t2 


} 
LB<-(t1+12)/2 
HHH 
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GUIDELINES FOR MANUSCRIPTS 


Before having a manuscript typed for submission, please examine a recent issue of Survey Methodology (Vol. 19, No. 1 and 
onward) as a guide and note particularly the points below. Articles must be submitted in machine-readable form, preferably 
in Word. A paper copy may be required for formulas and figures. 


Layout 


Manuscripts should be typed on white bond paper of standard size (8/2 x 11 inch), one side only, entirely double 
spaced with margins of at least 1% inches on all sides. 

The manuscripts should be divided into numbered sections with suitable verbal titles. 

The name and address of each author should be given as a footnote on the first page of the manuscript. 
Acknowledgements should appear at the end of the text. 

Any appendix should be placed after the acknowledgements but before the list of references. 


Abstract 


The manuscript should begin with an abstract consisting of one paragraph followed by three to six key words. Avoid 
mathematical expressions in the abstract. 


4.1 


4.2 


a1 


Wp 


Style 


Avoid footnotes, abbreviations, and acronyms. 

Mathematical symbols will be italicized unless specified otherwise except for functional symbols such as “exp(-)” 
and “log(-)’, etc. 

Short formulae should be left in the text but everything in the text should fit in single spacing. Long and important 
equations should be separated from the text and numbered consecutively with arabic numerals on the right if they are 
to be referred to later. 

Write fractions in the text using a solidus. 

Distinguish between ambiguous characters, (e.g., w, @; 0, O, 0; 1, 1). 

Italics are used for emphasis. Indicate italics by underlining on the manuscript. 


Figures and Tables 


All figures and tables should be numbered consecutively with arabic numerals, with titles which are as nearly self 
explanatory as possible, at the bottom for figures and at the top for tables. 

They should be put on separate pages with an indication of their appropriate placement in the text. (Normally they 
should appear near where they are first referred to). 


References 


References in the text should be cited with authors’ names and the date of publication. If part of a reference is cited, 
indicate after the reference, e.g., Cochran (1977, page 164). 

The list of references at the end of the manuscript should be arranged alphabetically and for the same author 
chronologically. Distinguish publications of the same author in the same year by attaching a, b, c to the year of 
publication. Journal titles should not be abbreviated. Follow the same format used in recent issues. 
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